Recipe Rating Analysis: Nutritional Values and Ingredients
1 Data Description and Preparation
1.1 Loading data
Here we load the dataset and do some cleaning and processing. We standardise the variables names and add an ID column to have a unique identifier for each recipe.
#loading the data
recipes_raw <- read.csv(here("data/epi_r.csv"))
recipes <- recipes_raw%>%
clean_names() %>%
mutate(ID = 1:nrow(.)) %>%
select(ID, everything())1.2 Data Description
tibble(Variables = c("**ID**", "**title**", "**rating**", "**calories**", "**protein**", "**fat**", "**sodium**", "**674 other binary variables**"), Meaning = c("Unique ID", "Recipe name", "Rating of the recipe", "Calories contained in the recipe", "Protein contained in the recipe (grams)","Fat contained in the recipe (grams)", "Sodium contained in the recipe (milligrams)", "The rest of the data is made of many binary variables, incl. ingredients, types of recipes, US States, diet preferences, etc."), Variable_Type = c("Character", "Character", "Categorical", "Numerical", "Numerical", "Numerical", "Numerical", "Binary"))%>%
kbl()%>%
kable_styling(position = "center")| Variables | Meaning | Variable_Type |
|---|---|---|
| ID | Unique ID | Character |
| title | Recipe name | Character |
| rating | Rating of the recipe | Categorical |
| calories | Calories contained in the recipe | Numerical |
| protein | Protein contained in the recipe (grams) | Numerical |
| fat | Fat contained in the recipe (grams) | Numerical |
| sodium | Sodium contained in the recipe (milligrams) | Numerical |
| 674 other binary variables | The rest of the data is made of many binary variables, incl. ingredients, types of recipes, US States, diet preferences, etc. | Binary |
1.3 Example of the data
selected_columns <- names(recipes)[recipes[1, ] == 1]
additional_variables <- c("title", "rating", "calories", "protein", "fat", "sodium")
selected_columns <- unique(c(additional_variables, selected_columns))
recipes[, selected_columns] %>%
filter(ID==1)#> title rating calories protein fat sodium
#> 1 Lentil, Apple, and Turkey Wrap 2.5 426 30 7 559
#> ID apple bean cookie fruit kid_friendly lentil lettuce sandwich
#> 1 1 1 1 1 1 1 1 1 1
#> tomato vegetable turkey
#> 1 1 1 1
Let us focus on the first recipe as an example and take a look at the different variables of interest. We can observe the title as well as the numerical variables such as rating, calories, protein, fat and sodium. We will then see among the many categorical variables, those that are related to our recipe such as apple, bean, cookie, fruit, kid_friendly and so on.
1.4 Classifying variables into categories
Given the high amount of variables that we had (680), we decided that we needed to somewhat create categories to aggregate them and be able to use them more easily.
Specifically, here are the things we had to solve when creating categories:
- Merge “father_s_day” and “fathers_day” same for mother’s day and for new year’s and st patrick and valentines day, vermout and vermouth
- Check what leafy_green is and how many obs there is of it
- Same for “legume”
- Decide if we group beans together or leave them in vegetables
- What do we do with “meat”, meatball and meatloaf. What about rabbit and for sausage or steak and venison
- Put nutmeg in spices or nuts?
- do we create a seeds category –> for example for “poppy” (put in spices for now) and for “seed” and sesame
- We should probably create a sauce category, and a “full_meal”
- where do we put tapioca and yuca
- do we put buttermilk in drinks or dairy?
- check if dorie_greenspa column exist or if it was just a typo without the N
- check how many observations with phyllo_puff_pastry_dough
- do we separate fish and seafood?
We have now solved all these issues and manually classified every variable of our dataset in specific categories shown below.
us_states <- c("alabama", "alaska", "arizona", "california", "colorado", "connecticut", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan", "minnesota", "mississippi", "missouri", "nebraska", "new_hampshire", "new_jersey", "new_mexico", "new_york", "north_carolina", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode_island", "south_carolina", "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west_virginia", "wisconsin")
us_cities <- c("aspen", "atlanta", "beverly_hills", "boston","brooklyn", "buffalo", "cambridge", "chicago", "columbus", "costa_mesa", "dallas", "denver", "healdsburg", "hollywood", "houston", "kansas_city", "lancaster", "las_vegas", "london", "los_angeles", "louisville", "miami", "minneapolis", "new_orleans", "pacific_palisades", "paris", "pasadena", "pittsburgh", "portland", "providence", "san_francisco", "santa_monica", "seattle", "st_louis", "washington_d_c", "westwood", "yonkers")
countries <- c("australia", "bulgaria", "canada", "chile", "cuba", "dominican_republic", "egypt", "england", "france", "germany", "guam", "haiti", "ireland", "israel", "italy", "jamaica", "japan", "mexico", "mezcal", "peru", "philippines", "spain", "switzerland")
alcohol <- c("alcoholic", "amaretto", "beer", "bitters", "bourbon", "brandy", "calvados", "campari", "chambord", "champagne", "chartreuse", "cocktail", "cognac_armagnac", "creme_de_cacao", "digestif", "eau_de_vie", "fortified_wine", "frangelico", "gin", "grand_marnier", "grappa", "kahlua", "kirsch", "liqueur", "long_beach", "margarita", "marsala", "martini", "midori", "pernod", "port", "punch", "red_wine", "rum", "sake", "sangria", "scotch", "sherry", "sparkling_wine", "spirit", "spritzer", "tequila", "triple_sec", "vermouth", "vodka", "whiskey", "white_wine", "wine")
others <- c("bon_appetit", "bon_app_tit", "condiment_spread", "cr_me_de_cacao", "epi_ushg", "flaming_hot_summer", "frankenrecipe", "harpercollins", "house_garden", "no_meat_no_problem", "parade", "sandwich_theory", "self", "shower", "tested_improved", "windsor", "weelicious", "snack_week", "tailgating", "quick_and_healthy", "picnic", "kitchen_olympics", "house_cocktail", "hors_d_oeuvre", "frozen_dessert", "freezer_food", "edible_gift", "cookbook_critic", "cook_like_a_diner", "condiment", "cocktail_party", "camping", "buffet", "x30_days_of_groceries", "x_cakeweek", "x_wasteless", "x22_minute_meals", "x3_ingredient_recipes")
chef <- c("anthony_bourdain", "dorie_greenspan", "emeril_lagasse", "nancy_silverton", "suzanne_goin")
interesting <- c("advance_prep_required", "entertaining", "epi_loves_the_microwave", "friendsgiving", "game", "gourmet", "healthy", "high_fiber", "hot_drink", "kid_friendly", "kidney_friendly", "microwave", "no_cook", "one_pot_meal", "oscars", "paleo", "pescatarian", "poker_game_night", "potluck", "quick_easy", "cookbooks", "leftovers")
seasons_vec <- c("winter", "spring", "summer", "fall")
celebrations <- c("anniversary", "back_to_school", "bastille_day", "birthday", "christmas", "christmas_eve", "cinco_de_mayo", "date", "diwali", "easter", "engagement_party", "family_reunion", "father_s_day", "fourth_of_july", "graduation", "halloween", "hanukkah", "kentucky_derby", "kwanzaa", "labor_day", "lunar_new_year", "mardi_gras", "mother_s_day", "new_year_s_day", "new_year_s_eve", "oktoberfest", "party", "passover", "persian_new_year", "purim", "ramadan", "rosh_hashanah_yom_kippur", "shavuot", "st_patrick_s_day", "sukkot", "super_bowl", "thanksgiving", "valentine_s_day", "wedding")
drink_no_alcohol_vec <- c("apple_juice", "fruit_juice", "iced_tea", "lemon_juice", "lime_juice", "orange_juice", "pomegranate_juice", "tea")
tools <- c("coffee_grinder", "double_boiler", "food_processor", "ice_cream_machine", "juicer", "mandoline", "mixer", "mortar_and_pestle", "pasta_maker", "ramekin", "skewer", "slow_cooker", "smoker", "wok", "blender", "candy_thermometer", "pressure_cooker")
cooking_techniques <- c("raw", "saute", "freeze_chill", "fry", "stir_fry", "simmer", "boil", "broil", "bake", "braise", "chill", "deep_fry", "steam", "rub", "roast", "poach", "pan_fry", "marinate", "grill_barbecue", "grill")
nutritional_values <- c("calories", "protein", "fat", "sodium")
recipe_type_vec <- c("aperitif", "appetizer", "breakfast", "brunch", "dessert", "dinner", "lunch", "side", "snack")
diet_preferences_vec <- c("dairy_free", "fat_free", "kosher","kosher_for_passover", "low_cal", "low_cholesterol", "low_carb", "low_fat", "low_sodium", "low_sugar", "low_no_sugar", "non_alcoholic", "no_sugar_added", "organic", "peanut_free", "soy_free", "sugar_conscious", "tree_nut_free", "vegan", "vegetarian", "wheat_gluten_free")
### Ingredients
#low level categories
vegetables_vec <- c("artichoke", "arugula", "asparagus", "butternut_squash", "bean", "beet", "bell_pepper", "bok_choy", "broccoli", "broccoli_rabe", "brussel_sprout", "cabbage", "capers", "carrot", "cauliflower", "celery", "chard", "chile_pepper", "collard_greens", "corn", "cucumber", "eggplant", "endive", "escarole", "fennel", "garlic", "ginger", "green_bean", "green_onion_scallion", "horseradish", "jerusalem_artichoke", "jicama", "kale", "leafy_green", "leek", "legume", "lentil", "lettuce", "lima_bean", "mushroom", "mustard_greens", "okra", "onion", "parsnip", "pea", "pickles", "poblano", "pumpkin", "radicchio", "radish", "root_vegetable", "rutabaga", "salad", "shallot", "soy", "spinach", "squash", "sugar_snap_pea", "tapioca", "tomatillo", "tomato", "turnip", "watercress", "yellow_squash", "yuca", "zucchini")
pork_meat_vec <- c("bacon", "ham", "pork", "pork_chop", "pork_rib", "pork_tenderloin", "prosciutto")
lamb_meat_vec <- c("ground_lamb", "lamb", "lamb_chop", "lamb_shank", "rack_of_lamb")
beef_meat_vec <- c("beef", "beef_rib", "beef_shank", "beef_tenderloin", "brisket", "ground_beef", "hamburger", "veal")
meat_with_wings_vec <- c("chicken", "duck", "goose", "poultry", "poultry_sausage", "quail", "turkey")
meat_various_vec <- c("meatball", "meatloaf", "rabbit", "sausage", "steak", "venison")
# stuff_in_the_water <- c("anchovy", "bass", "caviar", "clam", "cod", "crab", "fish", "halibut", "lobster", "mussel", "octopus", "oyster", "salmon", "sardine", "scallop", "seafood", "shellfish", "shrimp", "snapper", "squid", "swordfish", "tilapia", "trout", "tuna")
seafood_vec <- c("clam", "crab", "lobster", "mussel", "octopus", "oyster", "scallop", "shellfish", "shrimp", "squid")
fish_vec <- c("anchovy", "bass", "caviar", "cod", "halibut", "salmon", "sardine", "snapper", "swordfish", "tilapia", "trout", "tuna")
herbs_vec <- c("anise", "basil", "chive", "cilantro", "coriander", "dill", "lemongrass", "mint", "oregano", "parsley", "rosemary", "sage", "tarragon", "thyme")
nuts_vec <- c("almond", "cashew", "chestnut", "hazelnut", "macadamia_nut", "peanut", "pecan", "pine_nut", "pistachio", "tree_nut", "walnut")
cereals_vec <- c("barley", "bran", "bulgur", "grains", "granola", "oat", "quinoa", "rye", "whole_wheat")
carbs_vec <- c("brown_rice", "chickpea", "cornmeal", "couscous", "hominy_cornmeal_masa", "orzo", "pasta", "potato", "rice", "semolina", "sweet_potato_yam", "wild_rice")
fruits_vec <- c("apple", "apricot", "asian_pear", "avocado", "banana", "berry", "blackberry", "blueberry", "cantaloupe", "cherry", "citrus", "coconut", "cranberry", "currant", "dried_fruit", "fig", "grape", "grapefruit", "guava", "honeydew", "kiwi", "kumquat", "lemon", "lime", "lingonberry", "lychee", "mango", "melon", "nectarine", "olive", "orange", "papaya", "passion_fruit", "peach", "pear", "persimmon", "pineapple", "plantain", "plum", "pomegranate", "prune", "quince", "raisin", "raspberry", "rhubarb", "strawberry", "tamarind", "tangerine", "tropical_fruit", "watermelon")
dessert_vec <- c("biscuit", "brownie", "butterscotch_caramel", "cake", "candy", "chocolate", "cobbler_crumble", "cookie", "cookies", "cranberry_sauce", "crepe", "cupcake", "honey", "jam_or_jelly", "maple_syrup", "marshmallow", "muffin","pancake", "pastry", "pie", "smoothie", "sorbet", "souffle_meringue", "waffle")
cheeses_vec <- c("blue_cheese", "brie", "cheddar", "cottage_cheese", "cream_cheese", "feta", "fontina", "goat_cheese", "gouda", "monterey_jack", "mozzarella", "parmesan", "ricotta", "swiss_cheese")
dairy_vec <- c("butter", "buttermilk", "custard", "egg_nog", "ice_cream", "marscarpone", "milk_cream", "sour_cream", "yogurt")
spices_vec <- c("caraway", "cardamom", "chili", "cinnamon", "clove", "cumin", "curry", "hot_pepper", "jalapeno", "marinade", "nutmeg", "paprika", "pepper", "poppy", "saffron", "sesame", "sesame_oil", "soy_sauce", "vanilla", "wasabi")
#top level categories
general_categories <- c("vegetable", "meat", "fish", "seafood", "herb", "nut", "fruit", "drink", "cheese", "dairy", "spice")#using this to select the columns in ingredients_df and we could also use it later of for the for loop
all_meats <- c(beef_meat_vec, pork_meat_vec, lamb_meat_vec, meat_with_wings_vec, meat_various_vec)
all_fish_seafood <- c(fish_vec, seafood_vec)
all_ingredients <- c(vegetables_vec, all_meats, all_fish_seafood, herbs_vec, nuts_vec, cereals_vec, carbs_vec, fruits_vec, drink_no_alcohol_vec, dessert_vec, cheeses_vec, dairy_vec, spices_vec, "egg")
#stuff which isn't ingredients and that we need to sort
meals <- c("backyard_bbq", "bread", "breadcrumbs", "brine", "burrito", "casserole_gratin", "coffee", "flat_bread", "hummus", "iced_coffee", "lasagna", "macaroni_and_cheese", "mayonnaise", "mustard", "noodle", "oatmeal", "omelet", "peanut_butter", "pizza", "pot_pie", "potato_salad", "quiche", "rose", "salad_dressing", "salsa", "sandwich", "sauce", "seed", "soup_stew", "stew", "stock", "stuffing_dressing", "taco", "tart", "tofu", "tortillas", "vinegar", "frittata", "molasses", "sourdough", "fritter", "phyllo_puff_pastry_dough", "dip")#whole list of stuff to remove for now to be able to sort
to_remove_temp <- c(us_states, us_cities, countries, alcohol, wtf, chef, interesting, season, celebrations, tools, cooking_techniques, nutritional_values, repice_type, diet_preferences, all_ingredients, to_sort)
#tried this with select but didn't work because some columns in the vector don't exist in the dataset
recipes_to_filter <- recipes[, !(colnames(recipes) %in% to_remove_temp)]
#creates a tibble with one column with all the colnames to be able to sort ingredients
names <- recipes_to_filter %>% colnames() %>% as_tibble()
# recipes %>%
# select(iceland)
# filter(x_cakeweek==1)
#checked if some values in our categories weren't columns in recipes --> 19 of them weren't so I deleted them from the vectors
checking <- tibble(to_remove_temp[!to_remove_temp %in% colnames(recipes)])
checking
#checking only for ingredients because I feel like there is one in the ingredient vector which is not a column name
checking_ing <- tibble(all_ingredients[!all_ingredients %in% colnames(recipes)])
checking_ing
#no name comes out so the only logical conclusion is that there is a duplicated ingrediant name --> it was arugula
all_ingredients[duplicated(all_ingredients)]2 Data Understanding
2.1 Structure and summary
# Now let's see the structure of our data
recipes %>%
head(20) %>%
str() #> 'data.frame': 20 obs. of 681 variables:
#> $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ title : chr "Lentil, Apple, and Turkey Wrap "..
#> $ rating : num 2.5 4.38 3.75 5 3.12 ...
#> $ calories : num 426 403 165 NA 547 948 NA NA 170 ..
#> $ protein : num 30 18 6 NA 20 19 NA NA 7 23 ...
#> $ fat : num 7 23 7 NA 32 79 NA NA 10 41 ...
#> $ sodium : num 559 1439 165 NA 452 ...
#> $ x_cakeweek : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ x_wasteless : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ x22_minute_meals : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ x3_ingredient_recipes : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ x30_days_of_groceries : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ advance_prep_required : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ alabama : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ alaska : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ alcoholic : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ almond : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ amaretto : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ anchovy : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ anise : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ anniversary : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ anthony_bourdain : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ aperitif : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ appetizer : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ apple : num 1 0 0 0 0 0 0 0 0 0 ...
#> $ apple_juice : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ apricot : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ arizona : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ artichoke : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ arugula : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ asian_pear : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ asparagus : num 0 0 0 0 0 0 1 0 0 0 ...
#> $ aspen : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ atlanta : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ australia : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ avocado : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ back_to_school : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ backyard_bbq : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bacon : num 0 0 0 0 0 1 0 0 0 0 ...
#> $ bake : num 0 1 0 0 1 0 0 0 0 0 ...
#> $ banana : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ barley : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ basil : num 0 0 0 0 0 1 0 0 0 0 ...
#> $ bass : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bastille_day : num 0 1 0 0 0 0 0 0 0 0 ...
#> $ bean : num 1 0 0 0 0 0 0 0 0 0 ...
#> $ beef : num 0 0 0 0 0 0 0 0 1 0 ...
#> $ beef_rib : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ beef_shank : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ beef_tenderloin : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ beer : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ beet : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bell_pepper : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ berry : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ beverly_hills : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ birthday : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ biscuit : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bitters : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ blackberry : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ blender : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ blue_cheese : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ blueberry : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ boil : num 0 0 0 0 0 0 1 0 0 0 ...
#> $ bok_choy : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bon_appetit : num 0 1 0 1 1 1 1 0 0 0 ...
#> $ bon_app_tit : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ boston : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bourbon : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ braise : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bran : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brandy : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bread : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ breadcrumbs : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ breakfast : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brie : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brine : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brisket : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ broccoli : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ broccoli_rabe : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ broil : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brooklyn : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brown_rice : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brownie : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brunch : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ brussel_sprout : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ buffalo : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ buffet : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bulgaria : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ bulgur : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ burrito : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ butter : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ buttermilk : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ butternut_squash : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ butterscotch_caramel : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ cabbage : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ cake : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ california : num 0 0 0 0 1 0 0 0 0 0 ...
#> $ calvados : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ cambridge : num 0 0 0 0 0 0 0 0 0 0 ...
#> [list output truncated]
# We have only numerical variables, but in reality just 4 variables could be considered as such. More in particular, "rating", "calories", "protein", "fat" and "sodium" could be considered numerical. The other variables should be considered categorical since they allow only for 0 or 1 values.
# Let's have a different look at the data with the summary function.
recipes %>%
select(rating, calories, protein, fat, sodium) %>%
dfSummary(style = "grid")#> Data Frame Summary
#> recipes
#> Dimensions: 20052 x 5
#> Duplicates: 5576
#>
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
#> +====+===========+===========================+======================+===========+==========+=========+
#> | 1 | rating | Mean (sd) : 3.7 (1.3) | 0.00 : 1836 ( 9.2%) | I | 20052 | 0 |
#> | | [numeric] | min < med < max: | 1.25 : 164 ( 0.8%) | | (100.0%) | (0.0%) |
#> | | | 0 < 4.4 < 5 | 1.88!: 124 ( 0.6%) | | | |
#> | | | IQR (CV) : 0.6 (0.4) | 2.50 : 532 ( 2.7%) | | | |
#> | | | | 3.12!: 1489 ( 7.4%) | I | | |
#> | | | | 3.75 : 5169 (25.8%) | IIIII | | |
#> | | | | 4.38!: 8019 (40.0%) | IIIIIII | | |
#> | | | | 5.00 : 2719 (13.6%) | II | | |
#> | | | | ! rounded | | | |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 2 | calories | Mean (sd) : 6323 (359046) | 1858 distinct values | : | 15935 | 4117 |
#> | | [numeric] | min < med < max: | | : | (79.5%) | (20.5%) |
#> | | | 0 < 331 < 30111218 | | : | | |
#> | | | IQR (CV) : 388 (56.8) | | : | | |
#> | | | | | : | | |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 3 | protein | Mean (sd) : 100.2 (3840) | 282 distinct values | : | 15890 | 4162 |
#> | | [numeric] | min < med < max: | | : | (79.2%) | (20.8%) |
#> | | | 0 < 8 < 236489 | | : | | |
#> | | | IQR (CV) : 24 (38.3) | | : | | |
#> | | | | | : | | |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 4 | fat | Mean (sd) : 346.9 (20456) | 326 distinct values | : | 15869 | 4183 |
#> | | [numeric] | min < med < max: | | : | (79.1%) | (20.9%) |
#> | | | 0 < 17 < 1722763 | | : | | |
#> | | | IQR (CV) : 26 (59) | | : | | |
#> | | | | | : | | |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 5 | sodium | Mean (sd) : 6226 (333318) | 2434 distinct values | : | 15933 | 4119 |
#> | | [numeric] | min < med < max: | | : | (79.5%) | (20.5%) |
#> | | | 0 < 294 < 27675110 | | : | | |
#> | | | IQR (CV) : 631 (53.5) | | : | | |
#> | | | | | : | | |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
# We can already see for instance that the majority of the values of the variable "rating" are 4.38 (40% of the total). Moreover, we observe that the variables "calories", "protein", "fat" and "sodium" have roughly 20% of missing values.3 Data Cleaning
3.1 Analysis of NAs
#plot of missing values for each variable
recipes %>%
select(rating, all_of(nutritional_values)) %>%
gg_miss_var()+
labs(title = "Number of NA Values for Rating and the nutritional Variables")#we use the temp_df each time we want to create a temporary df for a single analysis and we know we won't reuse that dataframe later on
temp_df <- recipes %>%
select(title, all_of(nutritional_values))
na_obs <- which(rowSums(is.na(temp_df)) > 0)
# subset the original dataframe to only include rows with NA values
df_na <- temp_df[na_obs, ]
# print the result
#df_na
# count the number of NAs for each row
na_count <- rowSums(is.na(df_na))
# count the frequency of NA counts
freq_table_na <- table(na_count)
freq_na <- as.data.frame(freq_table_na) %>%
mutate(na_count = as.character(na_count))
freq_na %>%
ggplot(aes(x=na_count, y=Freq)) +
geom_bar(stat="identity") +
xlab("Number of NAs") +
ylab("Frequency") +
labs(title ="Number of NAs in nutritional values per recipe", subtitle = "NAs among variables calories, protein, fat, sodium") +
coord_flip()recipes <- recipes %>%
drop_na()Among the recipes which have NAs, we notice that many of them have 4 NAs for all the 4 nutritional values, more precisely 4117 out of 4188 recipes. Without any other information available, making an imputation to retrieve such values would not make any sense. Interestingly, we do not observe 3 contemporary NAs for recipes.
We could try to make an imputation of the 29 recipes that have only 1 NA. The same operation on the 42 recipes with 2 NAs would not deliver accurate and satisfying results. However, we believe that is not worth to make imputation of such NA values. We should not forget that the nutritional values per recipe are estimated, then making an imputation would result in a sort of estimation of an estimation. To what extent could it be reliable? We decide to eliminate recipes with NAs for nutritional values. Nutritional values represent a crucial information for our analysis.
Finally, we would still have 15864 recipes without NAs.
3.2 Eliminate recipes with rating equal to zero
rating_count <- table(recipes$rating) %>%
as.data.frame() %>%
rename(rating = Var1,
frequency = Freq)
recipes <- recipes %>%
filter(rating != 0)There are 1296 recipes which have rating equal to zero. Some of those might be unpopular, others might be too recent to have a rating. For the purpose of our analysis, we decide to eliminate these specific recipes.
We are left with 14568 observations after removing NAs and obs with a 0 rating value.
3.3 Discard copies of recipes
We want to eliminate recipes that have multiple copies. Sometimes the recipes have the same title, but nutritional values are different. This indicates that there are various ways to prepare a specific recipe. We want to keep those recipes that have the same title, but have different nutritional values.
Let’s check for instance Almond Butter Crisps, a recipe which can be found twice in the data set, with ID=1026 and ID=8908.
# recipes %>%
# filter(ID == 1026)
unique_recipes <- distinct(recipes, title, rating, protein, sodium, fat, calories, .keep_all = TRUE)
# unique_recipes %>%
# filter(ID == 8908)
recipes <- unique_recipesNow the data set is free from useless copies. We discarded 1288 copies in total. We lose a bit less if we remove the ones without specific ingredients first, meaning that some duplicate copies don’t contain specific ingredients either.
3.4 Removing recipes without specific ingredients listed
Here we are facing a challenge regarding the general ingredient categories. Indeed, when doing computations on the binary columns, there is no issue since, whether the recipe contains specific ingredients in a category, or only a 1 in the general category, then that information is captured in the corresponding binary column.
However, if we want to compute the total number of ingredients in each category that is present in recipes, then we are facing problems. To illustrate, let’s assume that we have a recipe which contains 3 vegetables (specific columns in the vegetables_vec). In addition, for that recipe, the general column is also a 1 –> then by summing up, we get 4 ingredients when it should be 3.
Another problem is related to recipes for which the only column in a category (e.g., vegetables) that has a 1 is the general category (i.e., vegetable), and there isn’t any specific ingredient listed within the vegetable category (in vegetables_vec) –> this can lead to issues when counting the number of specific ingredients per category.
In order to decide whether we want to analyse with or without general categories, let’s see how many observations would remain if we remove all the obs for which we have a general category at 1, and all specific ingredients in that category is set to 0.
#this filters out the observations which have 1 for general category and 0s for every ingredient in that category
recipes <- recipes %>%
filter(!(if_all(all_of(vegetables_vec), ~.x == 0) & vegetable == 1)) %>%
filter(!(if_all(all_of(all_meats), ~.x == 0) & meat == 1)) %>%
filter(!(if_all(all_of(fish_vec), ~.x == 0) & fish == 1)) %>%
filter(!(if_all(all_of(seafood_vec), ~.x == 0) & seafood == 1)) %>%
filter(!(if_all(all_of(herbs_vec), ~.x == 0) & herb == 1)) %>%
filter(!(if_all(all_of(nuts_vec), ~.x == 0) & nut == 1)) %>%
filter(!(if_all(all_of(fruits_vec), ~.x == 0) & fruit == 1)) %>%
filter(!(if_all(all_of(drink_no_alcohol_vec), ~.x == 0) & drink == 1)) %>%
filter(!(if_all(all_of(cheeses_vec), ~.x == 0) & cheese == 1)) %>%
filter(!(if_all(all_of(dairy_vec), ~.x == 0) & dairy == 1)) %>%
filter(!(if_all(all_of(spices_vec), ~.x == 0) & spice == 1))We are left with 10321 obs after removing all recipes which have no specific ingredient in at least one category, while that category general variable is at 1.
4 EDA
4.1 Nutrution EDA
4.1.1 Visual exploration - Univariate Analysis
4.1.1.1 Rating Barplot
recipes %>%
ggplot(aes(x=as.factor(rating), fill=as.factor(rating) )) +
geom_bar( ) +
scale_fill_manual(values = c("red4", "red3", "orangered", "orange", "gold", "greenyellow", "green3", "green4") ) +
theme(legend.position="none") +
scale_y_continuous(breaks=seq(0,10000,1000)) +
labs(x = "Rating", y = "Number of recipes",
title = "Overview of recipes' ratings")
The data available provide ratings which are separated in 7 distinct
categories, which span from 1.25 to 5. We do not forget that we decided
to exclude recipes with rating equal to zero. As we can see, most of the
ratings have value equal or above 3.75, more in particular we notice
that most of the recipes have ratings of 4.375.
4.1.1.2 Calories - Boxplot and Histogram
recipes_plot <- recipes %>%
pivot_longer(cols = c(calories, protein, fat, sodium),
names_to = "nutrition",
values_to = "n_value")
# Calories boxplot not filtered
recipes_plot %>%
filter(nutrition == "calories") %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_light() +
theme(legend.position="none",
plot.title = element_text(size=11)) +
ggtitle("Boxplot of calories nutritional value") +
xlab("") +
ylab("Value")recipes_plot %>%
filter(nutrition == "calories") %>%
select(title, nutrition, n_value) %>%
arrange(desc(n_value)) #> # A tibble: 10,321 x 3
#> title nutrition n_value
#> <chr> <chr> <dbl>
#> 1 "Pear-Cranberry Mincemeat Lattice Pie " calories 3.01e7
#> 2 "Deep-Dish Wild Blueberry Pie " calories 3.00e7
#> 3 "Apricot, Cranberry and Walnut Pie " calories 1.31e7
#> 4 "Lamb Köfte with Tarator Sauce " calories 4.52e6
#> 5 "Rice Pilaf with Lamb, Carrots, and Raisins " calories 4.16e6
#> 6 "Chocolate-Almond Pie " calories 3.36e6
#> 7 "Caramelized Apple and Pear Pie " calories 3.36e6
#> 8 "Merguez Lamb Patties with Golden Raisin Cousco~ calories 5.45e4
#> 9 "Grilled Lamb Chops with Porcini Mustard " calories 2.41e4
#> 10 "Braised Short Ribs with Red Wine Gravy " calories 1.96e4
#> # i 10,311 more rows
We notice that there are recipes with more than 30’000’000 calories which are clearly outliers. It is also hard to interpret these values from the boxplot and even with a density plot we cannot extract any insight. We must then discard those values in order to continue with a meaningful analysis. There are 28 recipes which have more than 7000 calories. We consider those as extreme values which represents 0.27% of the data available. Why do they exist? It could be due to a miscalculation or to an excessive number of servings per recipe. By evaluating the usual number of calories per recipe, we decided to keep those that have a reasonable quantity, i.e., below 7000.
# Calories boxplot
df <- recipes_plot %>%
filter(nutrition == "calories", n_value <= 7000)
boxplot_calories <- df %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot()+
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme(legend.position="none",
plot.title = element_text(size=11)) +
scale_y_continuous(breaks=seq(0,7000,500)) +
labs(x = "", y = "Calories", title = "Boxplot of calories nutritional value filtered")
# Calories histogram
histogram_calories <- df %>%
ggplot(aes(x=n_value)) +
geom_histogram(binwidth=50, fill="red3", color="red3", alpha=0.9) +
theme(plot.title = element_text(size=11)) +
scale_x_continuous(breaks=seq(0,10000,1000)) +
scale_y_continuous(breaks=seq(0,1750,250)) +
labs(x = "Count", y = "Calories", title = "Distribution of calories across all recipes")
# Calories density plot
density_calories <- df %>%
ggplot(aes(x=n_value)) +
geom_density(fill="red3", color="red2", alpha=0.8) +
theme(plot.title = element_text(size=11)) +
scale_x_continuous(breaks=seq(0,10000,1000)) +
ggtitle("Distribution of calories across all recipes") +
xlab("Calories")
grid.arrange(boxplot_calories, histogram_calories, density_calories, ncol=2, nrow =2)After filtering extreme values we can observe that most of the recipes have between 200 and 600 calories. By checking with the histogram the distribution of calories, we observe that indeed most of the recipes have less than 1000 calories.
4.1.1.3 Protein - Boxplot and Histogram
recipes_plot %>%
filter(nutrition == "protein") %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_light() +
theme(legend.position="none",
plot.title = element_text(size=11)) +
ggtitle("Boxplot of protein nutritional value") +
xlab("") +
ylab("Value")
We notice that there are recipes with more than 50’000 grams of protein
which are clearly outliers. We must then discard those values in order
to continue with a meaningful analysis. Otherwise from a visual point of
view we could not extract any relevant information. By checking on the
epicurious website recipes with protein values above 1000, we also
verified that the amount of proteins was not justified. We came to that
conclusion by evaluating the average values of protein per 100grams of
each ingredient in the specific recipe.
recipes_plot %>%
filter(nutrition == "protein") %>%
select(title, nutrition, n_value) %>%
arrange(desc(n_value)) #> # A tibble: 10,321 x 3
#> title nutrition n_value
#> <chr> <chr> <dbl>
#> 1 "Rice Pilaf with Lamb, Carrots, and Raisins " protein 236489
#> 2 "Pear-Cranberry Mincemeat Lattice Pie " protein 200968
#> 3 "Deep-Dish Wild Blueberry Pie " protein 200210
#> 4 "Lamb Köfte with Tarator Sauce " protein 166471
#> 5 "Apricot, Cranberry and Walnut Pie " protein 87188
#> 6 "Chocolate-Almond Pie " protein 58334
#> 7 "Caramelized Apple and Pear Pie " protein 58324
#> 8 "Merguez Lamb Patties with Golden Raisin Cousco~ protein 2074
#> 9 "Manhattan Clam Chowder " protein 1625
#> 10 "Clam and Oyster Chowder " protein 1365
#> # i 10,311 more rows
# Proteins boxplot
boxplot_protein <- recipes_plot %>%
filter(nutrition == "protein", n_value <= 1000) %>%
ggplot( aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme(legend.position="none",
plot.title = element_text(size=11)) +
scale_y_continuous(breaks=seq(0,7000,25)) +
ggtitle("Boxplot of protein nutritional value filtered") +
xlab("") +
ylab("Proteins")
# Proteins histogram
histogram_protein <- recipes_plot %>%
filter(nutrition == "protein", n_value <= 1000) %>%
ggplot(aes(x=n_value)) +
geom_histogram(binwidth=7, fill="red3", color="red3", alpha=0.9) +
theme(plot.title = element_text(size=15)) +
scale_x_continuous(breaks=seq(0,1000,50)) +
scale_y_continuous(breaks=seq(0,7000,250)) +
ggtitle("Distribution of proteins across all recipes") +
xlab("Proteins") +
ylab("Count")
# Protein density plot
density_protein <- recipes_plot %>%
filter(nutrition == "protein", n_value <= 1000) %>%
ggplot(aes(x=n_value)) +
geom_density(fill="red3", color="red2", alpha=0.8) +
scale_x_continuous(breaks=seq(0,1000,50)) +
ggtitle("Distribution of proteins across all recipes") +
xlab("Proteins")
grid.arrange(boxplot_protein, histogram_protein, density_protein, ncol=2, nrow =3)From the boxplot, we observe that most recipes have less than 30 grams of proteins. By plotting the histogram, we verify that this information is correct. We could even extend the range to 100 proteins per recipe. We assume that recipes with values above this threshold contain ingredients like meat, tuna, salmon or shrimps.
4.1.1.4 Sodium - Boxplot and Histogram
recipes_plot %>%
filter(nutrition == "sodium") %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme(legend.position="none",
plot.title = element_text(size=11)) +
ggtitle("Boxplot of sodium nutritional value") +
xlab("") +
ylab("Value")We notice that there are recipes with more than 100’000 milligrams of sodium which are clearly outliers. We must then discard those values in order to continue with a meaningful analysis. By conducting further research, we realize that sodium values above 30’000 are highly suspicious.
recipes_plot %>%
filter(nutrition == "sodium") %>%
select(title, nutrition, n_value) %>%
arrange(desc(n_value)) #> # A tibble: 10,321 x 3
#> title nutrition n_value
#> <chr> <chr> <dbl>
#> 1 "Pear-Cranberry Mincemeat Lattice Pie " sodium 27675110
#> 2 "Deep-Dish Wild Blueberry Pie " sodium 27570999
#> 3 "Apricot, Cranberry and Walnut Pie " sodium 12005810
#> 4 "Lamb Köfte with Tarator Sauce " sodium 7540990
#> 5 "Chocolate-Almond Pie " sodium 3449512
#> 6 "Caramelized Apple and Pear Pie " sodium 3449373
#> 7 "Rice Pilaf with Lamb, Carrots, and Raisins " sodium 3134853
#> 8 "Whole Branzino Roasted in Salt " sodium 132220
#> 9 "Red Snapper Baked in Salt with Romesco Sauce " sodium 132025
#> 10 "Scallops with Mushrooms in White-Wine Sauce " sodium 90572
#> # i 10,311 more rows
# Sodium boxplot
boxplot_sodium <- recipes_plot %>%
filter(nutrition == "sodium", n_value <= 30000) %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme(legend.position="none",
plot.title = element_text(size=11)) +
scale_y_continuous(breaks=seq(0,30000,500)) +
ggtitle("Boxplot of sodium nutritional value") +
xlab("") +
ylab("Sodium")
# Sodium histogram
histogram_sodium <- recipes_plot %>%
filter(nutrition == "sodium", n_value <= 30000) %>%
ggplot(aes(x=n_value)) +
geom_histogram(binwidth=50, fill="red3", color="red3", alpha=0.9) +
theme(plot.title = element_text(size=15)) +
scale_x_continuous(breaks=seq(0,30000,1000)) +
scale_y_continuous(breaks=seq(0,1750,250)) +
ggtitle("Distribution of sodium across all recipes") +
xlab("Sodium") +
ylab("Count")
# Sodium density plot
density_sodium <- recipes_plot %>%
filter(nutrition == "sodium", n_value <= 30000) %>%
ggplot(aes(x=n_value)) +
geom_density(fill="red3", color="red2", alpha=0.8) +
scale_x_continuous(breaks=seq(0,30000,1000)) +
ggtitle("Distribution of sodium across all recipes") +
xlab("Sodium")
grid.arrange(boxplot_sodium, histogram_sodium, density_sodium, ncol=1, nrow =3)From the boxplot we observe that most recipes have sodium values below 750 milligrams. The histogram informs us that most of recipes have indeed less than 750 milligrams of sodium, even though we cannot exclude the presence of a good amount of recipes with sodium between 750 and 2000 milligrams.
4.1.1.5 Fat - Boxplot and Histogram
recipes_plot %>%
filter(nutrition == "fat") %>%
ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_light() +
theme(legend.position="none",
plot.title = element_text(size=11)) +
ggtitle("Boxplot of fat nutritional value") +
xlab("") +
ylab("Value")We notice that there are recipes with more than 44’000 grams of fat which are clearly outliers. We must then discard those values in order to continue with a meaningful analysis. By checking on the epicurious website recipes with fat values above 1000, we also verified that the amount of proteins was not justified. We came to that conclusion by evaluating the average values of protein per 100grams of each ingredient in the specific recipe.
recipes_plot %>%
filter(nutrition == "fat") %>%
select(title, nutrition, n_value) %>%
arrange(desc(n_value)) #> # A tibble: 10,321 x 3
#> title nutrition n_value
#> <chr> <chr> <dbl>
#> 1 "Pear-Cranberry Mincemeat Lattice Pie " fat 1722763
#> 2 "Deep-Dish Wild Blueberry Pie " fat 1716279
#> 3 "Apricot, Cranberry and Walnut Pie " fat 747374
#> 4 "Rice Pilaf with Lamb, Carrots, and Raisins " fat 221495
#> 5 "Chocolate-Almond Pie " fat 186660
#> 6 "Caramelized Apple and Pear Pie " fat 186642
#> 7 "Lamb Köfte with Tarator Sauce " fat 44198
#> 8 "Grilled Lamb Chops with Porcini Mustard " fat 2228
#> 9 "Braised Short Ribs with Red Wine Gravy " fat 1818
#> 10 "Braised Duck Legs with Shallots and Parsnips " fat 1610
#> # i 10,311 more rows
# Fat boxplot
boxplot_fat <- recipes_plot %>%
filter(nutrition == "fat", n_value <= 40000) %>%
ggplot( aes(x=nutrition, y=n_value, fill=nutrition)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_light() +
theme(legend.position="none",
plot.title = element_text(size=11)) +
scale_y_continuous(breaks=seq(0,3000,100)) +
ggtitle("Boxplot of fat nutritional value filtered") +
xlab("") +
ylab("Fat")
# Fat histogram
histogram_fat <- recipes_plot %>%
filter(nutrition == "fat", n_value <= 40000) %>%
ggplot(aes(x=n_value)) +
geom_histogram(binwidth=7, fill="red3", color="red3", alpha=0.9) +
theme(plot.title = element_text(size=15)) +
ggtitle("Distribution of fat across all recipes") +
scale_x_continuous(breaks=seq(0,3000,100)) +
scale_y_continuous(breaks=seq(0,7000,250)) +
xlab("Fat") +
ylab("Count")
# Sodium density plot
density_fat<- recipes_plot %>%
filter(nutrition == "fat", n_value <= 40000) %>%
ggplot(aes(x=n_value)) +
geom_density(fill="red3", color="red2", alpha=0.8) +
scale_x_continuous(breaks=seq(0,3000,100)) +
ggtitle("Distribution of sodium across all recipes") +
xlab("Fat")
grid.arrange(boxplot_fat, histogram_fat, density_fat, ncol=3)
It is hard to interpret the boxplot. There are certain recipes which
could have potentially more than 1000 or even 2000 grams of fat because
of the high quantity of servings and the use of ingredients such as
lamb, duck and bacon. We must then analyse the histogram to have a
better overview and we notice that most recipes have fat values below
100 grams.
4.1.2 Visual exploration - Multivariate Analysis
4.1.2.1 Scatterplots of Rating-Calories
#removing 41 outliers we discovered above from the recipes df
recipes <- recipes %>%
filter(calories <= 7000, protein <= 1000, sodium <= 30000, fat <= 40000)
# Scatterplot of Rating-Calories
sp1 <- recipes %>%
ggplot(aes(x=calories, y=rating)) +
geom_point(alpha=.5) +
ggtitle("Scatterplot of rating against calories") +
xlab("Calories") +
ylab("Rating")
# Scatterplot of Rating-Protein
sp2 <- recipes %>%
ggplot(aes(x=protein, y=rating)) +
geom_point(alpha=.5) +
ggtitle("Scatterplot of rating against proteins") +
xlab("Proteins") +
ylab("Rating")
# Scatterplot of Rating-Fat
sp3 <- recipes %>%
ggplot(aes(x=fat, y=rating)) +
geom_point(alpha=.5) +
ggtitle("Scatterplot of rating against fat") +
xlab("Fat") +
ylab("Rating")
# Scatterplot of Rating-Sodium
sp4 <- recipes %>%
ggplot(aes(x=sodium, y=rating)) +
geom_point(alpha=.5) +
ggtitle("Scatterplot of rating against sodium") +
xlab("Sodium") +
ylab("Rating")
grid.arrange(sp1, sp2, sp3, sp4, ncol=2, nrow =2)
We can observe that the recipes with more than 2000 calories tend to
have a higher rating. For instance, few recipes with less than a 3 star
rating have more than 2000 calories.
We can observe that the recipes with more than 125 grams of proteins tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 125 grams of proteins.
We can observe that the recipes with more than 100 grams of fat tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 100 grams of fat.
We can observe that the recipes with more than 5000 milligrams of sodium tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 5000 mg of sodium.
4.1.2.2 Correlogram
corr_nutritional_values = recipes %>%
select(rating, calories, protein, fat, sodium) %>%
cor()
corrplot(corr_nutritional_values)
The previous scatterplots illuded us that there was somehow a
correlation between rating and the nutritional values. This hypothesis
has been refuted because the correlation against the rating is almost at
zero for all the nutritional values. On the other hand we notice a
strong positive correlation between calories and fat as well as between
calories and proteins.
4.1.2.3 Grouped Scatter
We decide to plot together the variables which highlight a great level of correlation.
# Grouped scatter of calories and fat
recipes_plot1 <- recipes %>%
filter(fat <= 400, calories <= 6000) %>%
ggplot(aes(x=calories, y=fat, color=rating)) +
geom_point(alpha=.5) +
scale_color_gradientn(colours = rainbow(5))
# Grouped scatter of calories and protein
recipes_plot2 <- recipes %>%
filter(protein <= 500, calories <= 6000) %>%
ggplot(aes(x=calories, y=protein, color=rating)) +
geom_point(alpha=.5) +
scale_color_gradientn(colours = rainbow(5))
# Grouped scatter of protein and fat
recipes_plot3 <- recipes %>%
filter(fat <= 400, protein <= 350) %>%
ggplot(aes(x=protein, y=fat, color=rating)) +
geom_point(alpha=.5) +
scale_color_gradientn(colours = rainbow(5))
# Grouped scatter of protein and sodium
recipes_plot4 <- recipes %>%
filter(sodium <= 400, protein <= 350) %>%
ggplot(aes(x=protein, y=sodium, color=rating)) +
geom_point(alpha=.5) +
scale_color_gradientn(colours = rainbow(5))
grid.arrange(recipes_plot1, recipes_plot2, recipes_plot3, recipes_plot4, ncol=2, nrow =2)
We notice a positive correlation between fat and calories as well as
between protein and calories. We also see a slightly positive
correlation between fat and protein. We tried to understand to what
extent the rating could have an impact on such relationships, but the
number of rating above 3 is overwhelming and strongly determines the
behavior of these relationships.
4.2 Ingredients EDA
4.2.1 Feature engineering
We discovered that the variable “drinks” on it’s own had only 11 observations, 4 of which also had the value “drink” equal to 1. Therefore, we decided to merge the two columns to simplify working with a single category called “drink” for all drinks.
#Creating a new dataframe with only the ID, title and the ingredients
ingredients_df <- recipes %>%
mutate(drink = ifelse(drink == 1 | drinks == 1, 1, 0)) %>% #merging drinks and drink
select(ID, title, all_of(all_ingredients), rating)4.2.1.1 Creating binary ingredients categories
ingredients_df_bin <- ingredients_df %>%
mutate(vegetables_bin = as.numeric(if_any(all_of(vegetables_vec), ~.x == 1, na.rm = TRUE)),
meats_bin = as.numeric(if_any(all_of(all_meats), ~.x == 1, na.rm = TRUE)),
fish_bin = as.numeric(if_any(all_of(fish_vec), ~.x == 1, na.rm = TRUE)),
seafood_bin = as.numeric(if_any(all_of(seafood_vec), ~.x == 1, na.rm = TRUE)),
herbs_bin = as.numeric(if_any(all_of(herbs_vec), ~.x == 1, na.rm = TRUE)),
nuts_bin = as.numeric(if_any(all_of(nuts_vec), ~.x == 1, na.rm = TRUE)),
fruits_bin = as.numeric(if_any(all_of(fruits_vec), ~.x == 1, na.rm = TRUE)),
cheese_bin = as.numeric(if_any(all_of(cheeses_vec), ~.x == 1, na.rm = TRUE)),
dairy_bin = as.numeric(if_any(all_of(dairy_vec), ~.x == 1, na.rm = TRUE)),
spices_bin = as.numeric(if_any(all_of(spices_vec), ~.x == 1, na.rm = TRUE)),
cereals_bin = as.numeric(if_any(all_of(cereals_vec), ~.x == 1, na.rm = TRUE)),
carbs_bin = as.numeric(if_any(all_of(carbs_vec), ~.x == 1, na.rm = TRUE)),
dessert_bin = as.numeric(if_any(all_of(dessert_vec), ~.x == 1, na.rm = TRUE)),
egg_bin = (egg)
) %>%
select(ID, title, contains("bin"), everything())The fact that both select the same number of rows makes having general categories redundant in the dataset. They are not useful to create the binary columns, and they are also not useful to compute the total amount of ingredients in each category per recipe –> let’s just not include them in the first place
####testing if I still need to include the general category to create the binary column now that I modified the df to only include recipes with ingredients specified
#
# #6586
# ingredients_df %>%
# mutate(vegetables_bin = as.factor(as.numeric(if_any(c(vegetable, all_of(vegetables_vec)), ~.x == 1, na.rm = TRUE)))) %>%
# filter(vegetables_bin == 1)
#
# #6586
# ingredients_df %>%
# mutate(vegetables_bin = as.factor(as.numeric(if_any(all_of(vegetables_vec), ~.x == 1, na.rm = TRUE)))) %>%
# filter(vegetables_bin == 1)4.2.1.2 Creating total ingredients categories
ingredients_df_total <- ingredients_df %>%
mutate(total_ingredients = rowSums(select(., c(all_of(all_ingredients)))),
total_vegetables = rowSums(select(., c(all_of(vegetables_vec)))),
total_meat = rowSums(select(., c(all_of(all_meats)))),
total_fish = rowSums(select(., c(all_of(fish_vec)))),
total_seafood = rowSums(select(., c(all_of(seafood_vec)))),
total_herbs = rowSums(select(., c(all_of(herbs_vec)))),
total_nuts = rowSums(select(., c(all_of(nuts_vec)))),
total_fruits = rowSums(select(., c(all_of(fruits_vec)))),
total_cheese = rowSums(select(., c(all_of(cheeses_vec)))),
total_dairy= rowSums(select(., c(all_of(dairy_vec)))),
total_spices= rowSums(select(., c(all_of(spices_vec)))),
total_cereals= rowSums(select(., c(all_of(cereals_vec)))),
total_carbs = rowSums(select(., c(all_of(carbs_vec)))),
total_dessert = rowSums(select(., c(all_of(dessert_vec))))
) %>%
select(ID, title, contains("total"), everything())4.2.1.3 Creating ingredients_df_full
Creating “ingredients_df_full” which contains bin columns, total columns, and original ingredients columns
total_join <- ingredients_df_total %>%
select(ID, contains("total"))
ingredients_df_full <- ingredients_df_bin %>%
left_join(total_join) %>%
select(ID, title, rating, contains("bin"), contains("total"), everything())4.2.2 “Binary” engineered ingredients categories
4.2.2.1 Frequency of ingredients - binary categories
This gives us interesting information about the frequency of each ingredient being present at least once in a recipe. As we can see, there is at least one vegetable in around 6750 recipes out of the 11380 total we have. Inversely, a very low amount of recipes contains at least one type of cereal.
#creating a vector with colnames of all the binary columns to be able to select them more easily afterwards
binary_columns <- ingredients_df_bin %>%
select(contains("bin")) %>%
colnames()
#adding binary columns to ingredients_df
total_categories <- ingredients_df_bin %>%
select(ID, all_of(binary_columns)) %>%
pivot_longer(-ID, names_to = "category", values_to = "binary_value") %>%
group_by(category) %>%
summarise(total = sum(binary_value))
#plotting the frequency of binary columns
total_categories %>%
ggplot(aes(x=reorder(category,total), y=total, fill=total)) +
geom_bar(stat = "identity") +
scale_x_discrete(guide = guide_axis(n.dodge=3))+
scale_fill_viridis() +
labs(x = "Category", y = "Amount of recipes", title = "Total amount of recipes containing at least one ingredient in defined categories")
Most of the recipes contain vegetables, fruits, meats, herbs and carbs,
which is not a surprise. The barplots below give us similar information
about the amount of recipe which contain at least one ingredient in each
category. The only category for which an ingredient is present at least
once in more than 50% of the recipes is vegetables.
ingredients_df_bin %>%
select(contains("bin")) %>%
mutate(across(everything(), as.factor)) %>%
plot_bar(order_bar = FALSE)4.2.2.2 Relationship between binary ingredients categories and 7-level rating variable
4.2.2.2.1 Barplots
We now want to check the relationship between our rating
variable and all the binary ingredients variables. We first plot this
relationships with multilevel barplots
ingredients_df_bin %>%
select(contains("bin"), rating) %>%
mutate(across(everything(), as.factor)) %>%
plot_bar(by = "rating", order_bar = FALSE)4.2.2.2.2 Correlation plot
We see some somewhat strong negative correlation between vegetables and dessert, and between vegetable and fruits. This makes sense, as these ingredients are rarely found together in recipes. As a side note, we chose to classify tomato as a vegetable and strongly stand by this opinion :)
Concerning positive correlations, we see nuts and desert as highly correlated. This is probably because they go well together in sugary recipes. Additionally, egg and dairy are also slightly positively correlated. This most likely comes from patisserie recipes where eggs and dairy ingredients go hand in hand.
When looking at the correlation of the binary ingredients variables
with the rating, we see that the highest negative correlation is between
rating and carbs_bin which could sound a bit surprising
given the assumed high popularity of pasta for example. The highest
positive correlation is with fruit_bin at 0.04. We note
however that all correlations between binary ingredients variables and
rating are disappointingly weak.
#corr plot
ingredients_df_bin %>%
select(contains("bin"), rating) %>%
plot_correlation()# # With rating as a factor
# ingredients_df_bin %>%
# select(contains("bin"), rating) %>%
# mutate(across(rating, as.factor)) %>%
# plot_correlation()
#
# #From this graph we also notice that recipes with rating at 3.75 have slightly positive relationship with carbs and vegetables, whereas recipes with rating at 5 have slightly negative relationship with the same features. We assume then that recipes containing vegetables and carbs tend to be less appreciated.4.2.2.3 Relationship between binary ingredients categories and binary rating variable
Given the low correlation between the “binary” ingredients categories and the 7-level rating, we decided to investigate if we could find a higher correlation by transforming the variable to a binary rating. We decide to put the threshold for a “bad” or “good” rating at 4.
ingredients_df_bin %>%
select(contains("bin"), rating) %>%
mutate(rating_bin = ifelse(rating>4, "good", "bad"), across(everything(), as.factor)) %>%
select(-rating) %>%
plot_bar(by = "rating_bin", order_bar = FALSE)
We now have only 2 categories: recipes with rating above 4 and recipes
with ratings below 4. There is no clear relationship in those graphs
either, and this confirms the correlation results that we have found
above for the 7-level rating variable..
If we look at vegetables for example, we can see that the proportion of recipes with ratings above 4 is higher for recipes containing no vegetables, when compared to recipes containing at least one vegetable.
We once again plot the correlation, but this time with the binary rating variable. The results are very similar to the 7-level rating variable, with very weak correlations.
#corr plot
ingredients_df_bin %>%
select(contains("bin"), rating) %>%
mutate(rating_bin = ifelse(rating>4, 1, 0)) %>%
select(-rating) %>%
plot_correlation()4.2.3 “Total” engineered ingredients variables
#Analysis which single ingredient is present in most recipes
df <- ingredients_df %>%
select(-title, -rating) %>%
pivot_longer(-ID, names_to = "ingredient", values_to = "value")
ing_top10 <- df %>%
group_by(ingredient) %>%
summarise(total = sum(value)) %>%
ungroup() %>%
arrange(desc(total)) %>%
dplyr::slice(1:10)
ing_top10 %>%
# mutate(ingredient = fct_rev(ingredient)) %>%
ggplot(aes(x=reorder(ingredient, total), y=total, fill=ingredient)) +
geom_bar(stat = "identity") +
scale_fill_viridis(discrete = TRUE) +
scale_x_discrete(guide = guide_axis(n.dodge=2))+
labs(x = "Ingredient", y = "Value", title = "Total amount of recipes containing each ingredient\nTop 10")
As we can observe, many recipes contain milk cream, onion, tomato,
salad, egg and garlic. Some of these are versatile as they can be used
for many different kinds of recipes. Onion and garlic are widely used
for giving flavour to many dishes, whereas egg and milk cream can be
used to cook salty and sweet recipes.
4.2.3.1 Correlation between total number of ingredients per category and 7-level rating variable
ingredients_df_total %>%
select(contains("total"), rating) %>%
plot_correlation()
We notice a negative relationship between the number of fruits and
number of vegetables which makes sense these two kinds of ingredients
are rarely combined in a recipe. The same is true for the number of
vegetables related to the number of desserts. The results are similar to
the previous correlogram with the binary columns. In this case we do not
notice any significant relationship between rating and the other
variables.
4.2.3.2 Correlation between total number of ingredients per category and binary rating variable
Results are disapointing again and in line with what we found so far. Correlations between total number of ingredients and the binary rating variable are also very weak.
#corr plot
ingredients_df_total %>%
select(contains("total"), rating) %>%
mutate(rating_bin = ifelse(rating>4, 1, 0)) %>%
select(-rating) %>%
plot_correlation()4.2.3.3 Amount of ingredients per recipe
The number of ingredients per recipe is more or less normally distributed, with a mean around 4.75.
#checking some stuff about the new ingredients table
ingredients_df_total %>%
select(ID, title, total_ingredients) %>%
ggplot(aes(x=total_ingredients)) +
geom_bar() + geom_vline(aes(xintercept=mean(total_ingredients)),color="red", linetype="dashed", size=1)+
scale_x_continuous(breaks = seq(1, 12, by = 1)) +
labs(x="Number of ingredients per recipe", y = "Recipe Count", title = "Distrubution of number of ingredients per recipe")#we notice that 117 (no NAs and RAT0, and not duplicated) recipes have 0 ingredients, let's investigate why and how that's possible
ingredients_df_total %>%
filter(total_ingredients==0)
#let's pick recipe ID number 1183 which should have poppy and sesame seeds according to the title
recipes %>%
filter(ID == 1183) %>%
select_if(~ any(. == 1))
#we can see that only 3 variables are equal to 1 here
recipes %>%
filter(ID == 365) %>%
select_if(~ any(. == 1))
recipes %>%
filter(ID == 1089) %>%
select_if(~ any(. == 1))
#####
#QUESTION: do we want to keep those recipes?
#####
ingredients_df_total %>%
filter(total_ingredients >10)Based on this information, we decide to eliminate those 117 observations which don’t contain any ingredients that we have classified in our vectors.
#eliminating all recipes which do not contain any ingredients that we have classified in categories --> within the "all_ingredients" vector
index_zero_ingredient <- ingredients_df_full %>%
filter(total_ingredients == 0) %>% pull(ID)
#removing the 117 recipes with no ingredients listed from recipes and ingredients_df_full
recipes <- recipes %>%
filter(!ID %in% index_zero_ingredient)
ingredients_df_full <- ingredients_df_full %>%
filter(!total_ingredients == 0)Besides the total amount of ingredients, let’s check the amount of ingredients per recipe for the top 3 categories in terms of ingredients frequency (i.e., vegetables, fruits, meats)
show_nveg <- ingredients_df_full %>%
filter(vegetables_bin == 1) %>%
select(ID, title, total_vegetables) %>%
ggplot(aes(x=total_vegetables)) +
scale_x_continuous(breaks = seq(1, 9, by = 1)) +
geom_bar() + geom_vline(aes(xintercept=mean(total_vegetables)),color="blue", linetype="dashed", size=1)
show_nfruit <- ingredients_df_full %>%
filter(fruits_bin == 1) %>%
select(ID, title, total_fruits) %>%
ggplot(aes(x=total_fruits)) +
scale_x_continuous(breaks = seq(1, 9, by = 1)) +
geom_bar() + geom_vline(aes(xintercept=mean(total_fruits)),color="blue", linetype="dashed", size=1)
# ingredients_df_full %>%
# select(ID, title, total_meat) %>%
# ggplot(aes(x=total_meat)) +
# geom_bar() + geom_vline(aes(xintercept=mean(total_meat)),color="blue", linetype="dashed", size=1)+
# labs(x="Number of meats per recipe", y = "Recipe Count", title = "Distrubution of number of meats per recipe")
#let's try to filter by recipes which contain meat to see if my functions work
show_nmeat <- ingredients_df_full %>%
filter(meats_bin == 1) %>%
select(ID, title, total_meat) %>%
ggplot(aes(x=total_meat)) +
geom_bar() + geom_vline(aes(xintercept=mean(total_meat)),color="blue", linetype="dashed", size=1)+
labs(x="Number of meats per recipe", y = "Recipe Count", title = "Distrubution of number of meats per recipe")
#why do we still have value in 0 meats --> it was because when creating the total meat column in ingredients_df_test I did not include the general meat category
grid.arrange(show_nveg, show_nfruit, show_nmeat, ncol=2, nrow =2)
As we can observe, most of the recipes have 1 or 2 vegetables and rarely
more than 4. The same is true for fruits. Concerning the meat, it usual
to see one kind of meat, but rare to see 3 or more in the same
recipe.
4.3 Mixed EDA - ingredients and nutritional value
recipes_select <- recipes %>%
select(ID, title, rating, calories, protein, sodium, fat)
ingredients_select <- ingredients_df_total %>%
select(ID, all_of(contains("total")))
recipes_more <- recipes_select %>%
left_join(ingredients_select,
by=c('ID'))
ingredients_bin_select <- ingredients_df_bin %>%
select(ID, contains("bin"))
recipes_full <- recipes_more %>%
left_join(ingredients_bin_select,
by=c('ID'))4.3.1 Correlation between nutritional values and engineered ingredients variables
recipes_full %>%
select(-ID) %>%
plot_correlation()
For instance we notice a positive correlation between proteins and
meats_bin which includes all sorts of meat. Another visible positive
correlation is the one between sodium and seafood_bin. We might also
want to investigate the relationship between calories and carbs_bin.
4.3.2 Barplot and boxplot - Meat and Proteins
# Barplot
barplot1 <- recipes_full %>%
ggplot(aes(x = factor(meats_bin), y = protein)) +
stat_summary(fun = mean, geom = "bar") +
ggtitle("Average amount of proteins per recipe with and without meat") +
xlab("Presence of Meat or not") +
ylab("Protein Content in grams")
# Boxplots per different kinds of meat
recipes_general <- recipes_full %>%
select(ID) %>%
left_join(recipes,
by=c('ID'))
recipes_meat <- recipes_general %>%
select(ID, title, rating, calories, protein, fat, sodium, all_of(all_meats))
recipes_meat <- recipes_meat %>%
pivot_longer(cols=c("beef", "beef_rib", "beef_shank", "beef_tenderloin", "brisket", "ground_beef", "hamburger", "veal", "bacon", "ham", "pork", "pork_chop", "pork_rib", "pork_tenderloin", "prosciutto", "ground_lamb", "lamb", "lamb_chop", "lamb_shank", "rack_of_lamb", "chicken", "duck", "goose", "poultry", "poultry_sausage", "quail", "turkey", "meatball", "meatloaf", "rabbit", "sausage", "steak", "venison" ),
names_to='meats',
values_to='yes_or_no') %>%
filter(yes_or_no == 1)
multi_boxplot1 <- recipes_meat %>%
filter(protein < 450) %>%
ggplot(aes(x=meats, y=protein, fill=meats)) +
geom_boxplot(alpha=0.3) +
scale_y_continuous(breaks=seq(0,7000,25)) +
coord_flip() +
ggtitle("Distribution of proteins per recipe according to different kinds of meat") +
xlab("Meats") +
ylab("Protein Content in grams") +
theme(legend.position="none")
# Here we want to show which kinds of meat specifically have a high level of proteins
grid.arrange(barplot1, multi_boxplot1, ncol=2, nrow =1)
It is very clear that among the recipes with meat, the average content
of protein is higher than in recipes without meat. Among those that have
meat we notice that goose, venison and lamb shank have the highest
content of protein.
4.3.3 Barplot and boxplot - Seafood and Sodium
# Seafood and sodium
barplot2 <- recipes_full %>%
ggplot(aes(x = factor(seafood_bin), y = sodium)) +
stat_summary(fun = mean, geom = "bar") +
ggtitle("Average amount of sodium per recipe with and without seafood") +
xlab("Presence of Seafood or not") +
ylab("Sodium Content in milligrams")
# Boxplots per different kinds of seafood
recipes_seafood <- recipes_general %>%
select(ID, title, rating, calories, protein, fat, sodium, all_of(seafood_vec))
recipes_seafood <- recipes_seafood %>%
pivot_longer(cols=c("clam", "crab", "lobster", "mussel", "octopus", "oyster", "scallop", "shellfish", "shrimp", "squid" ),
names_to='seafoods',
values_to='yes_or_no') %>%
filter(yes_or_no == 1)
multi_boxplot2 <- recipes_seafood %>%
filter(sodium < 10000) %>%
ggplot(aes(x=seafoods, y=sodium, fill=seafoods)) +
geom_boxplot(alpha=0.3) +
scale_y_continuous(breaks=seq(0,30000,500)) +
coord_flip() +
ggtitle("Distribution of sodium per recipe according to different kinds of seafood") +
xlab("Seafood") +
ylab("Sodium Content in milligrams") +
theme(legend.position="none")
# Here we want to show which kinds of seafood specifically have a high level of sodium
grid.arrange(barplot2, multi_boxplot2, ncol=2, nrow =1)
In this case it is very clear that among the recipes with seafood, the
average content of sodium is higher than in recipes without seafood.
Among those that have seafood we notice that clams and lobsters have the
highest content of sodium.
4.3.4 Barplot and boxplot - Carbs and Calories
# Carbs and calories
barplot3 <- recipes_full %>%
ggplot(aes(x = factor(carbs_bin), y = calories)) +
stat_summary(fun = mean, geom = "bar") +
ggtitle("Average amount of calories per recipe with and without carbohydrates") +
xlab("Presence of carbohydrates or not") +
ylab("Calories content")
# Afterwards we would also want to show which kinds of carbs specifically have a high number of calories
# Boxplots per different kinds of carbs
recipes_carbs <- recipes_general %>%
select(ID, title, rating, calories, protein, fat, sodium, all_of(carbs_vec))
recipes_carbs <- recipes_carbs %>%
pivot_longer(cols=c("brown_rice", "chickpea", "cornmeal", "couscous", "hominy_cornmeal_masa", "orzo", "pasta", "potato", "rice", "semolina", "sweet_potato_yam", "wild_rice"),
names_to='carbs',
values_to='yes_or_no') %>%
filter(yes_or_no == 1)
multi_boxplot3 <- recipes_carbs %>%
filter(sodium < 10000) %>%
ggplot(aes(x=carbs, y=calories, fill=carbs)) +
geom_boxplot(alpha=0.3) +
scale_y_continuous(breaks=seq(0,7000,500)) +
coord_flip() +
ggtitle("Distribution of calories per recipe according to different kinds of food high in carbohydrates") +
xlab("Carbs") +
ylab("Calories Content") +
theme(legend.position="none")
grid.arrange(barplot3, multi_boxplot3, ncol=2, nrow =1)
Recipes which contain carbs register a higher average content of
calories than recipes without carbs. Among those that have carbs we
notice that pasta and chickpeas are the richest in calories.
4.4 Exploratory PCA Analysis
- With all variables 680
recipes_pca <- recipes %>%
select(-ID, -title) %>%
PCA(ncp = 679, graph = FALSE)
recipes_pca#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 10163 individuals, described by 679 variables
#> *The results are available in the following objects:
#>
#> name description
#> 1 "$eig" "eigenvalues"
#> 2 "$var" "results for the variables"
#> 3 "$var$coord" "coord. for the variables"
#> 4 "$var$cor" "correlations variables - dimensions"
#> 5 "$var$cos2" "cos2 for the variables"
#> 6 "$var$contrib" "contributions of the variables"
#> 7 "$ind" "results for the individuals"
#> 8 "$ind$coord" "coord. for the individuals"
#> 9 "$ind$cos2" "cos2 for the individuals"
#> 10 "$ind$contrib" "contributions of the individuals"
#> 11 "$call" "summary statistics"
#> 12 "$call$centre" "mean of the variables"
#> 13 "$call$ecart.type" "standard error of the variables"
#> 14 "$call$row.w" "weights for the individuals"
#> 15 "$call$col.w" "weights for the variables"
fviz_pca_var(recipes_pca)
Hard to interpret this PCA output. The two dimensions explain together
only 2.1% of the variability in the data.
fviz_contrib(recipes_pca, choice = "var", axes = 1)fviz_contrib(recipes_pca, choice = "var", axes = 2)
Also in this case it difficult to understand which are contributing to
each dimension since the dimension itself accounts for a little
percentage of variability.
fviz_pca_biplot(recipes_pca) ## biplotfviz_eig(recipes_pca, addlabels = TRUE, ncp=11)recipe_hc <- hclust(dist(recipes[,-c(1,2)], method = "manhattan"))
recipe_clust <- cutree(recipe_hc, k = 10)
fviz_pca_biplot(recipes_pca,
col.ind = factor(recipe_clust))4.4.1 Exploratory PCA Analysis
- With only the ones we believe could be useful
nutritional_df <- recipes %>%
select(ID, all_of(nutritional_values))
###### CAREFUL --> recipes_analysis should be of dim 10163 x 33
recipes_analysis <- ingredients_df_full %>%
left_join(nutritional_df, by="ID") %>%
mutate(across(all_of(contains("bin"))), ID = as.character(ID)) %>%
select(rating, all_of(nutritional_values), contains("bin"), contains("total"))recipes_pca2 <- PCA(recipes_analysis, ncp = 33, graph = FALSE)
recipes_pca2#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 10163 individuals, described by 33 variables
#> *The results are available in the following objects:
#>
#> name description
#> 1 "$eig" "eigenvalues"
#> 2 "$var" "results for the variables"
#> 3 "$var$coord" "coord. for the variables"
#> 4 "$var$cor" "correlations variables - dimensions"
#> 5 "$var$cos2" "cos2 for the variables"
#> 6 "$var$contrib" "contributions of the variables"
#> 7 "$ind" "results for the individuals"
#> 8 "$ind$coord" "coord. for the individuals"
#> 9 "$ind$cos2" "cos2 for the individuals"
#> 10 "$ind$contrib" "contributions of the individuals"
#> 11 "$call" "summary statistics"
#> 12 "$call$centre" "mean of the variables"
#> 13 "$call$ecart.type" "standard error of the variables"
#> 14 "$call$row.w" "weights for the individuals"
#> 15 "$call$col.w" "weights for the variables"
fviz_pca_var(recipes_pca2)fviz_contrib(recipes_pca2, choice = "var", axes = 1)fviz_contrib(recipes_pca2, choice = "var", axes = 2)fviz_pca_biplot(recipes_pca2) ## biplotfviz_eig(recipes_pca2, addlabels = TRUE, ncp=11)p1 <- fviz_pca_biplot(recipes_pca2, axes = 1:2)
p2 <- fviz_pca_biplot(recipes_pca2, axes = 3:4)
p3 <- fviz_pca_biplot(recipes_pca2, axes = 5:6)
p4 <- fviz_pca_biplot(recipes_pca2, axes = 7:8)
p5 <- fviz_pca_biplot(recipes_pca2, axes = 9:10)
grid.arrange(p1, p2, p3, p4, p5, nrow = 3, ncol=2)recipe_hc2 <- hclust(dist(recipes_analysis, method = "manhattan"))
recipe_clust2 <- cutree(recipe_hc2, k = 10)
fviz_pca_biplot(recipes_pca2,
col.ind = factor(recipe_clust2))4.5 Seasons, Recipe Type, and Countries EDA
4.5.1 Seasons
#Create seasons df
seasons_df <- recipes %>%
select(ID, rating, all_of(seasons_vec)) %>%
filter(if_any(all_of(seasons_vec)) == 1) %>%
mutate(sum_season = rowSums(across(all_of(seasons_vec))))
seasons_df %>%
ggplot(aes(x=sum_season)) +
geom_bar()seasons_df %>%
filter(sum_season==3)#> ID rating winter spring summer fall sum_season
#> 1 1271 4.375 0 1 1 1 3
#> 2 1430 4.375 1 1 0 1 3
#> 3 2672 4.375 0 1 1 1 3
#> 4 2917 3.750 1 1 0 1 3
#> 5 4180 4.375 1 1 0 1 3
#> 6 7740 4.375 1 1 0 1 3
#> 7 8045 4.375 0 1 1 1 3
#> 8 10782 4.375 0 1 1 1 3
#> 9 12421 4.375 1 1 0 1 3
#> 10 12545 5.000 1 1 1 0 3
#> 11 15494 4.375 1 0 1 1 3
#> 12 16603 3.750 1 0 1 1 3
#> 13 16729 4.375 0 1 1 1 3
#> 14 17102 3.750 1 1 0 1 3
#> 15 18603 4.375 1 1 0 1 3
#> 16 18854 5.000 1 1 0 1 3
#> 17 18908 5.000 1 1 1 0 3
#> 18 19913 3.750 1 0 1 1 3
seasons_df %>%
filter(sum_season==4)#> ID rating winter spring summer fall sum_season
#> 1 54 5.000 1 1 1 1 4
#> 2 1837 3.750 1 1 1 1 4
#> 3 4733 3.125 1 1 1 1 4
#> 4 4980 3.750 1 1 1 1 4
#> 5 7341 1.250 1 1 1 1 4
#> 6 8593 5.000 1 1 1 1 4
#> 7 13759 4.375 1 1 1 1 4
#> 8 14419 3.750 1 1 1 1 4
#> 9 16448 2.500 1 1 1 1 4
#> 10 19011 3.750 1 1 1 1 4
#total of 29 recipes with either 3 or 4 --> let's discard them
#let's look a bit more closely to those with 2 seasons to see if they are next to each other or not
seasons_df %>%
filter(sum_season==2)#> ID rating winter spring summer fall sum_season
#> 1 25 3.750 1 0 0 1 2
#> 2 27 3.750 0 1 1 0 2
#> 3 57 4.375 1 0 0 1 2
#> 4 67 3.125 1 0 0 1 2
#> 5 130 4.375 1 0 0 1 2
#> 6 135 3.750 1 0 0 1 2
#> 7 153 4.375 0 0 1 1 2
#> 8 156 4.375 1 0 0 1 2
#> 9 167 4.375 1 0 0 1 2
#> 10 179 4.375 1 0 0 1 2
#> 11 238 3.750 0 0 1 1 2
#> 12 256 3.750 1 0 0 1 2
#> 13 275 3.750 0 1 1 0 2
#> 14 278 4.375 0 1 1 0 2
#> 15 368 4.375 1 0 0 1 2
#> 16 369 4.375 0 1 1 0 2
#> 17 401 3.750 1 0 0 1 2
#> 18 449 3.750 0 0 1 1 2
#> 19 501 4.375 0 1 1 0 2
#> 20 516 3.125 1 0 0 1 2
#> 21 538 5.000 1 0 0 1 2
#> 22 543 3.125 0 1 1 0 2
#> 23 558 4.375 1 0 0 1 2
#> 24 597 4.375 1 0 0 1 2
#> 25 599 4.375 1 0 0 1 2
#> 26 614 4.375 1 1 0 0 2
#> 27 618 4.375 1 0 0 1 2
#> 28 639 3.750 1 0 0 1 2
#> 29 664 3.750 1 0 0 1 2
#> 30 666 5.000 1 0 0 1 2
#> 31 682 4.375 1 1 0 0 2
#> 32 697 5.000 0 1 1 0 2
#> 33 700 4.375 0 1 1 0 2
#> 34 738 3.750 1 0 0 1 2
#> 35 802 3.750 0 0 1 1 2
#> 36 830 3.750 1 0 0 1 2
#> 37 864 5.000 1 0 0 1 2
#> 38 866 4.375 1 0 0 1 2
#> 39 890 3.125 1 0 0 1 2
#> 40 900 4.375 1 0 0 1 2
#> 41 902 4.375 1 0 0 1 2
#> 42 903 3.750 1 0 0 1 2
#> 43 917 4.375 1 0 0 1 2
#> 44 920 2.500 0 1 1 0 2
#> 45 956 4.375 1 0 0 1 2
#> 46 964 4.375 0 1 1 0 2
#> 47 986 4.375 1 0 0 1 2
#> 48 1003 4.375 1 1 0 0 2
#> 49 1060 4.375 1 0 0 1 2
#> 50 1106 4.375 0 1 1 0 2
#> 51 1134 4.375 1 0 0 1 2
#> 52 1179 3.125 1 0 0 1 2
#> 53 1207 4.375 0 1 1 0 2
#> 54 1212 4.375 0 1 1 0 2
#> 55 1268 4.375 0 1 1 0 2
#> 56 1275 3.750 0 1 1 0 2
#> 57 1290 4.375 1 0 0 1 2
#> 58 1308 3.125 0 0 1 1 2
#> 59 1345 4.375 0 0 1 1 2
#> 60 1346 3.125 1 0 0 1 2
#> 61 1405 5.000 1 0 0 1 2
#> 62 1406 3.750 1 0 0 1 2
#> 63 1411 4.375 1 0 0 1 2
#> 64 1459 3.750 1 0 0 1 2
#> 65 1507 4.375 0 1 1 0 2
#> 66 1528 4.375 0 1 1 0 2
#> 67 1581 5.000 1 0 0 1 2
#> 68 1602 5.000 1 1 0 0 2
#> 69 1648 3.750 0 1 1 0 2
#> 70 1688 3.125 0 1 0 1 2
#> 71 1732 3.750 0 0 1 1 2
#> 72 1744 3.750 1 0 0 1 2
#> 73 1757 3.750 1 0 0 1 2
#> 74 1794 5.000 1 1 0 0 2
#> 75 1797 3.125 1 0 0 1 2
#> 76 1805 4.375 0 1 1 0 2
#> 77 1806 4.375 1 1 0 0 2
#> 78 1832 3.750 1 0 0 1 2
#> 79 1846 5.000 1 0 0 1 2
#> 80 1895 3.750 1 0 0 1 2
#> 81 1932 4.375 1 0 0 1 2
#> 82 1940 4.375 1 0 0 1 2
#> 83 2020 4.375 0 1 1 0 2
#> 84 2035 5.000 0 1 1 0 2
#> 85 2069 5.000 1 0 0 1 2
#> 86 2088 4.375 0 1 1 0 2
#> 87 2093 4.375 1 0 0 1 2
#> 88 2117 4.375 1 0 0 1 2
#> 89 2137 4.375 1 0 0 1 2
#> 90 2207 4.375 0 0 1 1 2
#> 91 2241 3.750 1 0 0 1 2
#> 92 2243 5.000 1 0 0 1 2
#> 93 2245 4.375 1 0 0 1 2
#> 94 2261 5.000 0 1 1 0 2
#> 95 2269 4.375 1 0 0 1 2
#> 96 2328 4.375 1 0 0 1 2
#> 97 2331 4.375 1 0 0 1 2
#> 98 2380 3.750 1 0 0 1 2
#> 99 2417 3.750 0 1 1 0 2
#> 100 2518 3.750 1 0 0 1 2
#> 101 2610 3.750 0 1 1 0 2
#> 102 2620 4.375 1 0 0 1 2
#> 103 2694 3.125 1 0 0 1 2
#> 104 2717 4.375 1 0 0 1 2
#> 105 2727 5.000 0 0 1 1 2
#> 106 2733 3.750 1 0 0 1 2
#> 107 2746 5.000 0 1 1 0 2
#> 108 2756 3.125 1 0 0 1 2
#> 109 2769 4.375 1 0 0 1 2
#> 110 2778 4.375 0 1 1 0 2
#> 111 2825 5.000 1 0 0 1 2
#> 112 2835 3.750 1 0 0 1 2
#> 113 2838 3.750 0 1 1 0 2
#> 114 2842 3.750 1 0 0 1 2
#> 115 2898 3.750 1 0 0 1 2
#> 116 2938 4.375 0 1 1 0 2
#> 117 2952 4.375 0 1 1 0 2
#> 118 2963 5.000 1 0 0 1 2
#> 119 2965 5.000 1 1 0 0 2
#> 120 2966 3.750 1 0 0 1 2
#> 121 3007 3.125 1 0 0 1 2
#> 122 3010 4.375 0 1 0 1 2
#> 123 3049 4.375 1 0 0 1 2
#> 124 3076 4.375 1 0 0 1 2
#> 125 3149 4.375 0 1 1 0 2
#> 126 3172 3.750 1 0 0 1 2
#> 127 3212 4.375 1 0 0 1 2
#> 128 3219 3.750 0 1 1 0 2
#> 129 3233 3.750 1 0 0 1 2
#> 130 3238 5.000 0 1 1 0 2
#> 131 3257 5.000 1 0 0 1 2
#> 132 3310 4.375 1 0 0 1 2
#> 133 3369 4.375 1 0 0 1 2
#> 134 3402 3.750 1 0 0 1 2
#> 135 3406 4.375 1 0 0 1 2
#> 136 3417 4.375 0 1 1 0 2
#> 137 3468 3.125 0 1 1 0 2
#> 138 3548 4.375 1 0 0 1 2
#> 139 3562 4.375 1 0 0 1 2
#> 140 3569 3.750 1 0 0 1 2
#> 141 3600 5.000 1 0 0 1 2
#> 142 3618 5.000 0 1 1 0 2
#> 143 3663 4.375 1 0 0 1 2
#> 144 3698 4.375 1 0 0 1 2
#> 145 3734 4.375 1 0 1 0 2
#> 146 3773 3.750 1 0 0 1 2
#> 147 3881 4.375 0 1 1 0 2
#> 148 3926 4.375 1 0 0 1 2
#> 149 3945 4.375 1 0 0 1 2
#> 150 3985 4.375 1 0 0 1 2
#> 151 4016 3.750 0 1 1 0 2
#> 152 4025 4.375 1 0 0 1 2
#> 153 4040 4.375 0 1 1 0 2
#> 154 4056 4.375 0 1 1 0 2
#> 155 4064 4.375 1 0 0 1 2
#> 156 4074 4.375 0 1 0 1 2
#> 157 4195 4.375 0 1 1 0 2
#> 158 4204 4.375 1 0 0 1 2
#> 159 4241 5.000 0 1 0 1 2
#> 160 4377 4.375 1 0 0 1 2
#> 161 4414 3.125 1 0 0 1 2
#> 162 4473 4.375 1 0 0 1 2
#> 163 4510 4.375 0 0 1 1 2
#> 164 4539 4.375 1 0 0 1 2
#> 165 4552 4.375 1 0 0 1 2
#> 166 4637 4.375 1 0 0 1 2
#> 167 4682 4.375 1 0 0 1 2
#> 168 4737 4.375 1 0 0 1 2
#> 169 4788 4.375 1 0 0 1 2
#> 170 4839 4.375 1 0 0 1 2
#> 171 4887 3.750 1 0 0 1 2
#> 172 4946 4.375 1 0 0 1 2
#> 173 4966 4.375 0 1 1 0 2
#> 174 5149 5.000 0 1 1 0 2
#> 175 5151 4.375 1 0 0 1 2
#> 176 5216 3.750 1 0 0 1 2
#> 177 5235 4.375 0 1 1 0 2
#> 178 5373 4.375 1 0 0 1 2
#> 179 5407 1.875 1 0 0 1 2
#> 180 5410 5.000 1 0 1 0 2
#> 181 5450 3.750 1 0 0 1 2
#> 182 5451 4.375 0 1 1 0 2
#> 183 5485 4.375 0 1 1 0 2
#> 184 5516 3.750 1 1 0 0 2
#> 185 5537 2.500 0 1 1 0 2
#> 186 5597 4.375 1 0 0 1 2
#> 187 5613 3.125 0 0 1 1 2
#> 188 5622 3.750 1 0 0 1 2
#> 189 5655 4.375 1 0 0 1 2
#> 190 5660 4.375 1 1 0 0 2
#> 191 5689 4.375 1 0 0 1 2
#> 192 5757 4.375 0 0 1 1 2
#> 193 5775 1.875 1 0 0 1 2
#> 194 5778 3.750 1 0 0 1 2
#> 195 5820 3.750 1 0 1 0 2
#> 196 5880 5.000 0 1 1 0 2
#> 197 5905 1.875 1 0 0 1 2
#> 198 5975 4.375 1 0 0 1 2
#> 199 5996 4.375 1 0 0 1 2
#> 200 6006 4.375 1 0 0 1 2
#> 201 6053 4.375 0 1 0 1 2
#> 202 6124 4.375 1 0 1 0 2
#> 203 6148 4.375 1 0 0 1 2
#> 204 6150 4.375 1 0 0 1 2
#> 205 6155 3.750 0 1 1 0 2
#> 206 6175 3.750 1 0 0 1 2
#> 207 6235 3.125 1 0 0 1 2
#> 208 6238 4.375 1 0 0 1 2
#> 209 6241 5.000 0 1 1 0 2
#> 210 6260 3.750 0 1 0 1 2
#> 211 6310 4.375 1 0 0 1 2
#> 212 6315 4.375 1 0 0 1 2
#> 213 6358 4.375 0 0 1 1 2
#> 214 6373 3.125 0 1 1 0 2
#> 215 6422 5.000 0 0 1 1 2
#> 216 6426 4.375 1 0 0 1 2
#> 217 6432 5.000 0 1 1 0 2
#> 218 6467 4.375 1 0 0 1 2
#> 219 6479 4.375 0 1 1 0 2
#> 220 6515 4.375 1 0 0 1 2
#> 221 6564 3.750 0 1 1 0 2
#> 222 6569 4.375 1 0 0 1 2
#> 223 6573 4.375 1 0 0 1 2
#> 224 6614 3.750 0 0 1 1 2
#> 225 6648 5.000 1 0 0 1 2
#> 226 6650 5.000 1 0 0 1 2
#> 227 6704 4.375 1 0 0 1 2
#> 228 6709 3.750 1 0 0 1 2
#> 229 6780 4.375 0 1 1 0 2
#> 230 6825 4.375 1 0 0 1 2
#> 231 6847 4.375 0 1 1 0 2
#> 232 7121 3.750 1 0 0 1 2
#> 233 7140 4.375 1 0 0 1 2
#> 234 7144 3.750 0 1 1 0 2
#> 235 7282 4.375 0 1 1 0 2
#> 236 7285 3.750 1 0 0 1 2
#> 237 7310 3.750 1 0 0 1 2
#> 238 7326 4.375 1 0 0 1 2
#> 239 7333 3.125 0 1 0 1 2
#> 240 7397 5.000 1 0 0 1 2
#> 241 7480 5.000 0 1 1 0 2
#> 242 7482 4.375 0 1 1 0 2
#> 243 7532 5.000 0 1 1 0 2
#> 244 7557 4.375 1 0 0 1 2
#> 245 7565 3.750 1 0 0 1 2
#> 246 7586 4.375 1 0 0 1 2
#> 247 7754 3.125 1 0 0 1 2
#> 248 7782 4.375 1 0 0 1 2
#> 249 7803 4.375 1 0 0 1 2
#> 250 7815 4.375 0 0 1 1 2
#> 251 7829 4.375 0 1 1 0 2
#> 252 7871 4.375 1 0 0 1 2
#> 253 7898 5.000 0 1 1 0 2
#> 254 7997 4.375 0 0 1 1 2
#> 255 8019 3.750 1 0 0 1 2
#> 256 8021 3.750 1 0 0 1 2
#> 257 8105 4.375 0 1 1 0 2
#> 258 8106 5.000 1 0 0 1 2
#> 259 8107 4.375 0 1 1 0 2
#> 260 8218 4.375 1 0 0 1 2
#> 261 8228 4.375 1 0 0 1 2
#> 262 8253 4.375 0 1 1 0 2
#> 263 8266 3.750 0 1 1 0 2
#> 264 8270 4.375 0 1 1 0 2
#> 265 8312 3.750 1 0 0 1 2
#> 266 8417 4.375 1 0 0 1 2
#> 267 8442 5.000 0 1 1 0 2
#> 268 8459 3.750 0 1 1 0 2
#> 269 8461 4.375 0 1 1 0 2
#> 270 8641 3.750 0 1 1 0 2
#> 271 8693 3.750 1 1 0 0 2
#> 272 8705 3.750 0 0 1 1 2
#> 273 8717 4.375 1 0 0 1 2
#> 274 8728 4.375 1 0 0 1 2
#> 275 8740 4.375 0 1 1 0 2
#> 276 8792 4.375 0 1 1 0 2
#> 277 8839 4.375 1 0 0 1 2
#> 278 8856 4.375 1 0 0 1 2
#> 279 8923 4.375 1 0 0 1 2
#> 280 8942 4.375 1 0 0 1 2
#> 281 8977 3.750 1 0 0 1 2
#> 282 8989 4.375 1 0 0 1 2
#> 283 8994 4.375 1 0 0 1 2
#> 284 9000 4.375 1 0 0 1 2
#> 285 9054 3.125 0 1 1 0 2
#> 286 9055 3.750 0 1 0 1 2
#> 287 9080 4.375 1 0 0 1 2
#> 288 9081 3.750 0 1 1 0 2
#> 289 9087 4.375 1 1 0 0 2
#> 290 9088 4.375 0 1 1 0 2
#> 291 9092 3.750 1 0 0 1 2
#> 292 9141 4.375 1 0 0 1 2
#> 293 9213 3.125 1 1 0 0 2
#> 294 9238 4.375 1 0 0 1 2
#> 295 9264 4.375 0 1 1 0 2
#> 296 9277 4.375 1 0 0 1 2
#> 297 9333 3.750 1 0 0 1 2
#> 298 9345 4.375 0 0 1 1 2
#> 299 9354 4.375 0 1 1 0 2
#> 300 9365 3.750 0 1 1 0 2
#> 301 9378 4.375 1 0 0 1 2
#> 302 9399 3.750 0 1 1 0 2
#> 303 9514 4.375 1 0 0 1 2
#> 304 9523 4.375 1 0 0 1 2
#> 305 9533 5.000 1 0 0 1 2
#> 306 9582 4.375 1 0 0 1 2
#> 307 9636 4.375 0 1 1 0 2
#> 308 9638 3.750 1 0 0 1 2
#> 309 9692 5.000 1 0 0 1 2
#> 310 9701 3.750 1 0 0 1 2
#> 311 9775 5.000 1 0 0 1 2
#> 312 9790 4.375 1 0 0 1 2
#> 313 9879 3.125 1 0 0 1 2
#> 314 9909 3.750 0 1 1 0 2
#> 315 9930 5.000 0 1 1 0 2
#> 316 9961 4.375 0 1 0 1 2
#> 317 9983 5.000 1 0 0 1 2
#> 318 10009 3.125 1 0 0 1 2
#> 319 10016 4.375 1 0 0 1 2
#> 320 10017 4.375 0 0 1 1 2
#> 321 10043 4.375 1 0 0 1 2
#> 322 10045 3.750 1 0 0 1 2
#> 323 10063 3.750 1 0 0 1 2
#> 324 10067 3.750 1 0 0 1 2
#> 325 10069 4.375 0 1 1 0 2
#> 326 10118 5.000 1 0 0 1 2
#> 327 10153 4.375 0 1 1 0 2
#> 328 10155 4.375 1 1 0 0 2
#> 329 10216 3.750 1 0 0 1 2
#> 330 10257 4.375 1 0 0 1 2
#> 331 10262 3.125 1 1 0 0 2
#> 332 10276 3.750 1 0 0 1 2
#> 333 10307 3.750 1 0 0 1 2
#> 334 10318 3.750 1 0 0 1 2
#> 335 10339 3.750 1 0 0 1 2
#> 336 10389 4.375 1 0 0 1 2
#> 337 10400 5.000 0 1 1 0 2
#> 338 10424 3.750 1 0 0 1 2
#> 339 10436 5.000 1 0 0 1 2
#> 340 10442 4.375 1 0 0 1 2
#> 341 10499 4.375 0 1 1 0 2
#> 342 10505 5.000 1 0 0 1 2
#> 343 10511 4.375 1 0 1 0 2
#> 344 10594 5.000 1 0 0 1 2
#> 345 10756 3.125 1 0 0 1 2
#> 346 10805 4.375 1 0 0 1 2
#> 347 10881 3.750 0 1 1 0 2
#> 348 10923 4.375 1 0 0 1 2
#> 349 10939 5.000 1 0 0 1 2
#> 350 10968 5.000 1 0 0 1 2
#> 351 10975 4.375 1 0 0 1 2
#> 352 10989 3.750 0 0 1 1 2
#> 353 10997 3.750 0 1 1 0 2
#> 354 11038 3.750 1 0 0 1 2
#> 355 11046 4.375 1 0 0 1 2
#> 356 11052 5.000 0 1 1 0 2
#> 357 11074 4.375 0 1 1 0 2
#> 358 11170 4.375 1 0 0 1 2
#> 359 11182 3.750 1 0 0 1 2
#> 360 11199 3.750 1 0 0 1 2
#> 361 11237 3.750 0 1 1 0 2
#> 362 11385 4.375 0 0 1 1 2
#> 363 11387 3.125 1 0 0 1 2
#> 364 11441 3.750 1 0 0 1 2
#> 365 11453 1.250 0 0 1 1 2
#> 366 11538 4.375 1 0 0 1 2
#> 367 11580 4.375 0 0 1 1 2
#> 368 11587 5.000 1 0 0 1 2
#> 369 11613 4.375 1 0 0 1 2
#> 370 11623 5.000 0 1 1 0 2
#> 371 11640 4.375 1 0 0 1 2
#> 372 11694 4.375 0 1 1 0 2
#> 373 11707 3.750 1 0 0 1 2
#> 374 11723 3.750 1 0 0 1 2
#> 375 11727 3.750 1 0 0 1 2
#> 376 11738 3.125 0 1 1 0 2
#> 377 11794 3.750 1 0 0 1 2
#> 378 11823 3.750 1 0 0 1 2
#> 379 11858 3.750 0 1 1 0 2
#> 380 11907 2.500 1 0 0 1 2
#> 381 11952 3.750 1 0 0 1 2
#> 382 12062 4.375 1 0 0 1 2
#> 383 12069 4.375 1 0 1 0 2
#> 384 12179 5.000 0 1 1 0 2
#> 385 12233 3.125 1 0 0 1 2
#> 386 12251 4.375 1 0 0 1 2
#> 387 12321 4.375 0 1 1 0 2
#> 388 12330 3.750 1 0 0 1 2
#> 389 12451 3.750 0 1 1 0 2
#> 390 12458 4.375 1 0 0 1 2
#> 391 12480 3.125 0 1 0 1 2
#> 392 12490 3.750 0 1 1 0 2
#> 393 12505 4.375 1 0 0 1 2
#> 394 12533 5.000 0 1 1 0 2
#> 395 12651 3.750 1 0 0 1 2
#> 396 12717 5.000 1 0 0 1 2
#> 397 12739 3.750 0 1 1 0 2
#> 398 12761 4.375 1 0 0 1 2
#> 399 12797 4.375 1 0 0 1 2
#> 400 12832 4.375 1 0 0 1 2
#> 401 12834 5.000 1 0 0 1 2
#> 402 12882 3.125 1 0 0 1 2
#> 403 12899 4.375 1 0 0 1 2
#> 404 12905 4.375 1 0 0 1 2
#> 405 12974 3.750 1 0 0 1 2
#> 406 12979 4.375 1 0 0 1 2
#> 407 13041 5.000 1 0 0 1 2
#> 408 13054 3.750 1 1 0 0 2
#> 409 13112 3.750 1 0 0 1 2
#> 410 13125 5.000 1 0 0 1 2
#> 411 13147 4.375 1 0 0 1 2
#> 412 13200 3.750 1 0 0 1 2
#> 413 13242 3.750 1 0 0 1 2
#> 414 13270 4.375 1 0 0 1 2
#> 415 13284 3.750 1 0 0 1 2
#> 416 13292 3.750 1 0 0 1 2
#> 417 13331 4.375 1 0 0 1 2
#> 418 13357 4.375 1 0 0 1 2
#> 419 13385 4.375 1 0 0 1 2
#> 420 13387 4.375 0 1 1 0 2
#> 421 13412 4.375 1 0 0 1 2
#> 422 13422 4.375 1 0 1 0 2
#> 423 13424 5.000 0 1 1 0 2
#> 424 13562 3.750 1 0 0 1 2
#> 425 13723 3.750 0 0 1 1 2
#> 426 13846 3.750 1 1 0 0 2
#> 427 13890 5.000 0 1 1 0 2
#> 428 13912 4.375 0 0 1 1 2
#> 429 13916 3.750 0 1 1 0 2
#> 430 13940 4.375 1 0 0 1 2
#> 431 14000 4.375 1 0 0 1 2
#> 432 14006 3.125 1 0 1 0 2
#> 433 14022 4.375 1 0 0 1 2
#> 434 14067 5.000 0 1 0 1 2
#> 435 14078 5.000 0 0 1 1 2
#> 436 14086 4.375 1 0 0 1 2
#> 437 14130 4.375 1 0 0 1 2
#> 438 14175 3.125 0 1 0 1 2
#> 439 14214 4.375 0 1 1 0 2
#> 440 14256 3.750 0 1 0 1 2
#> 441 14261 5.000 1 0 0 1 2
#> 442 14264 4.375 1 0 0 1 2
#> 443 14425 4.375 1 0 0 1 2
#> 444 14443 2.500 1 0 0 1 2
#> 445 14447 4.375 0 1 1 0 2
#> 446 14496 3.750 1 0 0 1 2
#> 447 14535 3.750 0 1 0 1 2
#> 448 14563 5.000 0 1 1 0 2
#> 449 14581 5.000 0 1 1 0 2
#> 450 14582 4.375 1 0 1 0 2
#> 451 14613 4.375 1 0 0 1 2
#> 452 14673 5.000 1 0 0 1 2
#> 453 14718 4.375 1 0 0 1 2
#> 454 14726 3.750 1 0 0 1 2
#> 455 14729 3.750 0 0 1 1 2
#> 456 14750 4.375 1 0 0 1 2
#> 457 14755 4.375 1 0 0 1 2
#> 458 14768 3.750 0 0 1 1 2
#> 459 14774 4.375 0 1 1 0 2
#> 460 14784 3.750 0 1 0 1 2
#> 461 14791 3.125 1 0 0 1 2
#> 462 14835 3.750 0 0 1 1 2
#> 463 14946 4.375 1 0 0 1 2
#> 464 14967 3.750 1 0 0 1 2
#> 465 15026 3.750 1 0 0 1 2
#> 466 15044 4.375 1 0 0 1 2
#> 467 15054 5.000 0 1 1 0 2
#> 468 15072 3.750 1 0 0 1 2
#> 469 15083 2.500 1 0 0 1 2
#> 470 15115 2.500 1 0 0 1 2
#> 471 15134 4.375 1 0 0 1 2
#> 472 15137 4.375 1 0 0 1 2
#> 473 15141 4.375 1 0 0 1 2
#> 474 15161 4.375 0 1 1 0 2
#> 475 15179 3.750 1 0 0 1 2
#> 476 15217 5.000 1 0 0 1 2
#> 477 15289 3.125 1 0 1 0 2
#> 478 15302 4.375 0 1 1 0 2
#> 479 15315 3.750 0 1 1 0 2
#> 480 15355 3.750 1 0 0 1 2
#> 481 15357 2.500 1 0 0 1 2
#> 482 15379 5.000 0 1 1 0 2
#> 483 15480 4.375 0 1 1 0 2
#> 484 15512 4.375 0 1 1 0 2
#> 485 15542 3.125 1 0 0 1 2
#> 486 15613 4.375 1 0 0 1 2
#> 487 15644 4.375 0 1 0 1 2
#> 488 15667 3.750 0 1 1 0 2
#> 489 15707 5.000 1 0 0 1 2
#> 490 15738 4.375 1 0 0 1 2
#> 491 15879 4.375 0 1 1 0 2
#> 492 15959 5.000 0 1 1 0 2
#> 493 16010 4.375 1 1 0 0 2
#> 494 16037 3.750 1 1 0 0 2
#> 495 16040 4.375 0 1 1 0 2
#> 496 16042 3.750 1 0 0 1 2
#> 497 16074 1.875 0 1 1 0 2
#> 498 16100 4.375 0 1 1 0 2
#> 499 16153 4.375 1 0 0 1 2
#> 500 16183 4.375 0 1 0 1 2
#> 501 16212 4.375 1 0 0 1 2
#> 502 16215 3.125 1 0 1 0 2
#> 503 16216 4.375 1 0 0 1 2
#> 504 16222 4.375 1 0 0 1 2
#> 505 16280 4.375 1 0 0 1 2
#> 506 16308 4.375 1 0 0 1 2
#> 507 16334 4.375 1 1 0 0 2
#> 508 16337 3.750 0 1 1 0 2
#> 509 16369 4.375 1 0 0 1 2
#> 510 16394 4.375 0 1 1 0 2
#> 511 16418 4.375 0 0 1 1 2
#> 512 16474 3.750 0 1 1 0 2
#> 513 16486 4.375 1 0 0 1 2
#> 514 16524 3.750 1 0 0 1 2
#> 515 16571 4.375 0 0 1 1 2
#> 516 16596 4.375 1 0 0 1 2
#> 517 16620 5.000 0 1 1 0 2
#> 518 16637 4.375 1 0 0 1 2
#> 519 16639 5.000 1 0 0 1 2
#> 520 16660 4.375 1 0 1 0 2
#> 521 16702 3.750 0 1 1 0 2
#> 522 16748 1.250 0 1 1 0 2
#> 523 16777 3.750 1 0 0 1 2
#> 524 16787 3.750 1 1 0 0 2
#> 525 16919 5.000 0 1 1 0 2
#> 526 16961 3.750 1 0 0 1 2
#> 527 16981 3.125 1 0 0 1 2
#> 528 16993 4.375 1 0 0 1 2
#> 529 16997 3.750 1 0 0 1 2
#> 530 17004 4.375 0 1 1 0 2
#> 531 17051 4.375 1 0 0 1 2
#> 532 17063 4.375 0 1 1 0 2
#> 533 17071 5.000 0 1 1 0 2
#> 534 17124 4.375 0 1 0 1 2
#> 535 17165 4.375 1 0 0 1 2
#> 536 17219 5.000 1 0 0 1 2
#> 537 17234 3.750 1 0 0 1 2
#> 538 17264 3.750 1 0 0 1 2
#> 539 17287 4.375 1 0 0 1 2
#> 540 17289 3.750 1 0 0 1 2
#> 541 17342 3.750 0 1 1 0 2
#> 542 17363 4.375 1 0 0 1 2
#> 543 17380 4.375 1 1 0 0 2
#> 544 17491 3.750 1 0 0 1 2
#> 545 17495 3.750 1 0 0 1 2
#> 546 17499 4.375 1 0 0 1 2
#> 547 17547 2.500 1 0 0 1 2
#> 548 17584 5.000 1 0 0 1 2
#> 549 17600 5.000 1 0 0 1 2
#> 550 17614 4.375 1 0 0 1 2
#> 551 17629 3.750 1 0 0 1 2
#> 552 17645 3.750 0 1 1 0 2
#> 553 17660 4.375 1 0 0 1 2
#> 554 17696 4.375 1 0 0 1 2
#> 555 17697 4.375 1 0 0 1 2
#> 556 17752 4.375 1 0 0 1 2
#> 557 17864 4.375 1 0 0 1 2
#> 558 17886 5.000 1 0 0 1 2
#> 559 17904 3.750 0 1 0 1 2
#> 560 17978 5.000 1 0 0 1 2
#> 561 17997 4.375 1 0 0 1 2
#> 562 18014 3.750 1 0 0 1 2
#> 563 18029 4.375 0 1 1 0 2
#> 564 18085 4.375 1 0 1 0 2
#> 565 18143 4.375 1 0 0 1 2
#> 566 18153 2.500 1 0 0 1 2
#> 567 18213 5.000 1 0 1 0 2
#> 568 18236 2.500 0 0 1 1 2
#> 569 18251 3.750 0 1 1 0 2
#> 570 18268 4.375 1 0 0 1 2
#> 571 18325 3.750 1 0 0 1 2
#> 572 18327 4.375 1 0 0 1 2
#> 573 18446 4.375 1 0 1 0 2
#> 574 18530 3.750 1 0 0 1 2
#> 575 18602 4.375 0 1 1 0 2
#> 576 18825 4.375 1 0 0 1 2
#> 577 18832 3.750 1 0 0 1 2
#> 578 18837 5.000 1 0 0 1 2
#> 579 18852 4.375 1 0 0 1 2
#> 580 18885 4.375 1 0 0 1 2
#> 581 18982 3.750 0 1 1 0 2
#> 582 19108 3.125 1 0 0 1 2
#> 583 19117 3.125 1 0 0 1 2
#> 584 19205 5.000 1 0 0 1 2
#> 585 19273 4.375 1 0 0 1 2
#> 586 19329 4.375 0 1 1 0 2
#> 587 19338 3.750 1 0 0 1 2
#> 588 19369 4.375 0 1 1 0 2
#> 589 19399 3.750 1 0 0 1 2
#> 590 19493 3.750 0 1 1 0 2
#> 591 19539 4.375 0 1 1 0 2
#> 592 19547 3.750 1 0 0 1 2
#> 593 19563 4.375 1 0 0 1 2
#> 594 19615 4.375 1 0 0 1 2
#> 595 19625 4.375 1 0 0 1 2
#> 596 19627 3.750 1 0 0 1 2
#> 597 19643 3.125 0 0 1 1 2
#> 598 19685 4.375 1 0 0 1 2
#> 599 19792 3.750 0 1 1 0 2
#> 600 19802 4.375 1 0 0 1 2
#> 601 19815 4.375 1 0 0 1 2
#> 602 19881 4.375 0 1 1 0 2
#> 603 19889 4.375 1 0 0 1 2
#> 604 19905 5.000 0 1 1 0 2
#> 605 20041 4.375 1 0 0 1 2
#> 606 20049 4.375 0 1 1 0 2
### should we keep those observations???As we can see, there is again no correlation between seasons and recipe ratings.
seasons_df %>%
select(-sum_season, -ID) %>%
plot_correlation()CAN TRY TO feature engineer rating above 4
Inconclusive result, again.
seasons_df <-seasons_df %>%
mutate(rating_above_4 = ifelse(rating > 4, 1, 0), rating_5 = ifelse(rating == 5, 1, 0), rating_1.25 = ifelse(rating == 1.25, 1, 0))
seasons_df %>%
select(rating, rating_above_4, all_of(seasons_vec))%>%
plot_correlation()seasons_df %>%
select(rating_5, rating_1.25, all_of(seasons_vec)) %>%
plot_correlation()4.5.2 Recipe Type
#TO ADD WHEN COMPLETED
#taking into account only the "main" recipe types
recipe_types_select <- c("breakfast", "brunch", "dessert", "dinner", "lunch")
type_df <- recipes %>%
select(ID, rating, all_of(recipe_types_select)) %>%
filter(if_any(all_of(recipe_types_select)) == 1) %>%
mutate(sum_type= rowSums(across(all_of(recipe_types_select))))
type_df %>%
ggplot(aes(x=sum_type)) +
geom_bar()type_df %>%
filter(sum_type == 2)#> ID rating breakfast brunch dessert dinner lunch sum_type
#> 1 27 3.750 0 0 0 1 1 2
#> 2 50 4.375 1 1 0 0 0 2
#> 3 52 3.750 1 1 0 0 0 2
#> 4 58 4.375 1 1 0 0 0 2
#> 5 67 3.125 0 0 0 1 1 2
#> 6 241 4.375 0 0 0 1 1 2
#> 7 304 3.750 0 0 0 1 1 2
#> 8 361 4.375 0 1 0 0 1 2
#> 9 367 4.375 0 1 0 0 1 2
#> 10 389 4.375 0 0 0 1 1 2
#> 11 429 5.000 1 1 0 0 0 2
#> 12 433 4.375 0 0 0 1 1 2
#> 13 449 3.750 0 0 0 1 1 2
#> 14 467 4.375 1 1 0 0 0 2
#> 15 476 4.375 0 1 1 0 0 2
#> 16 501 4.375 0 1 1 0 0 2
#> 17 502 3.750 0 0 0 1 1 2
#> 18 513 5.000 0 0 0 1 1 2
#> 19 514 5.000 0 0 0 1 1 2
#> 20 533 4.375 1 1 0 0 0 2
#> 21 534 3.750 1 1 0 0 0 2
#> 22 558 4.375 0 0 0 1 1 2
#> 23 574 4.375 1 0 1 0 0 2
#> 24 593 3.125 1 1 0 0 0 2
#> 25 610 4.375 0 0 0 1 1 2
#> 26 645 4.375 1 1 0 0 0 2
#> 27 648 3.750 1 1 0 0 0 2
#> 28 694 4.375 1 1 0 0 0 2
#> 29 728 3.125 1 1 0 0 0 2
#> 30 730 4.375 0 0 0 1 1 2
#> 31 813 3.750 0 1 1 0 0 2
#> 32 890 3.125 0 0 0 1 1 2
#> 33 909 3.750 1 1 0 0 0 2
#> 34 924 3.750 0 0 0 1 1 2
#> 35 1020 3.125 0 0 0 1 1 2
#> 36 1035 4.375 0 1 0 0 1 2
#> 37 1038 3.125 0 0 0 1 1 2
#> 38 1057 3.750 0 0 0 1 1 2
#> 39 1129 4.375 0 0 0 1 1 2
#> 40 1179 3.125 1 0 1 0 0 2
#> 41 1185 4.375 1 1 0 0 0 2
#> 42 1207 4.375 0 0 0 1 1 2
#> 43 1245 4.375 0 0 0 1 1 2
#> 44 1262 4.375 0 0 0 1 1 2
#> 45 1285 4.375 0 0 0 1 1 2
#> 46 1315 3.125 0 0 0 1 1 2
#> 47 1329 3.750 1 1 0 0 0 2
#> 48 1346 3.125 1 1 0 0 0 2
#> 49 1387 3.750 0 0 0 1 1 2
#> 50 1426 4.375 1 1 0 0 0 2
#> 51 1453 3.750 1 1 0 0 0 2
#> 52 1456 5.000 1 1 0 0 0 2
#> 53 1511 4.375 0 0 0 1 1 2
#> 54 1525 3.750 0 0 0 1 1 2
#> 55 1528 4.375 0 0 0 1 1 2
#> 56 1537 5.000 0 0 0 1 1 2
#> 57 1540 3.750 0 0 0 1 1 2
#> 58 1613 4.375 0 0 0 1 1 2
#> 59 1625 4.375 0 1 1 0 0 2
#> 60 1648 3.750 0 0 0 1 1 2
#> 61 1659 5.000 1 1 0 0 0 2
#> 62 1725 4.375 1 1 0 0 0 2
#> 63 1782 4.375 0 0 0 1 1 2
#> 64 1803 4.375 0 0 0 1 1 2
#> 65 1829 3.125 0 0 0 1 1 2
#> 66 1874 5.000 0 0 0 1 1 2
#> 67 1933 3.750 1 1 0 0 0 2
#> 68 1940 4.375 0 0 0 1 1 2
#> 69 1965 3.750 1 1 0 0 0 2
#> 70 2035 5.000 0 0 0 1 1 2
#> 71 2041 3.750 1 1 0 0 0 2
#> 72 2092 4.375 0 0 0 1 1 2
#> 73 2139 3.750 1 1 0 0 0 2
#> 74 2184 1.875 1 1 0 0 0 2
#> 75 2413 4.375 1 1 0 0 0 2
#> 76 2421 4.375 0 0 0 1 1 2
#> 77 2478 3.125 0 0 0 1 1 2
#> 78 2489 3.750 1 1 0 0 0 2
#> 79 2518 3.750 0 0 0 1 1 2
#> 80 2528 2.500 1 1 0 0 0 2
#> 81 2540 2.500 1 1 0 0 0 2
#> 82 2604 4.375 1 1 0 0 0 2
#> 83 2634 4.375 1 1 0 0 0 2
#> 84 2710 3.750 1 1 0 0 0 2
#> 85 2738 3.750 0 0 0 1 1 2
#> 86 2745 3.125 0 1 1 0 0 2
#> 87 2769 4.375 0 0 0 1 1 2
#> 88 2799 4.375 1 1 0 0 0 2
#> 89 2812 4.375 1 1 0 0 0 2
#> 90 2847 4.375 0 1 0 0 1 2
#> 91 2865 4.375 1 1 0 0 0 2
#> 92 2884 4.375 0 0 0 1 1 2
#> 93 2895 4.375 1 1 0 0 0 2
#> 94 2904 5.000 0 0 0 1 1 2
#> 95 2912 3.125 1 1 0 0 0 2
#> 96 3001 4.375 1 1 0 0 0 2
#> 97 3025 4.375 0 0 0 1 1 2
#> 98 3097 5.000 1 1 0 0 0 2
#> 99 3157 4.375 0 1 1 0 0 2
#> 100 3172 3.750 0 0 0 1 1 2
#> 101 3234 4.375 0 0 0 1 1 2
#> 102 3245 4.375 1 1 0 0 0 2
#> 103 3342 4.375 0 1 0 0 1 2
#> 104 3354 4.375 1 1 0 0 0 2
#> 105 3405 4.375 0 0 0 1 1 2
#> 106 3423 4.375 1 1 0 0 0 2
#> 107 3492 4.375 0 0 0 1 1 2
#> 108 3524 5.000 0 0 0 1 1 2
#> 109 3572 5.000 0 0 0 1 1 2
#> 110 3670 4.375 1 1 0 0 0 2
#> 111 3734 4.375 0 0 0 1 1 2
#> 112 3806 4.375 0 0 0 1 1 2
#> 113 3811 3.750 1 1 0 0 0 2
#> 114 3855 3.750 1 1 0 0 0 2
#> 115 3926 4.375 1 1 0 0 0 2
#> 116 3983 4.375 1 1 0 0 0 2
#> 117 3985 4.375 0 0 0 1 1 2
#> 118 3997 3.750 0 0 0 1 1 2
#> 119 4014 4.375 0 0 0 1 1 2
#> 120 4038 4.375 1 1 0 0 0 2
#> 121 4092 4.375 1 1 0 0 0 2
#> 122 4096 4.375 0 0 0 1 1 2
#> 123 4097 3.750 0 0 0 1 1 2
#> 124 4101 1.250 1 0 1 0 0 2
#> 125 4122 4.375 0 1 0 0 1 2
#> 126 4135 5.000 1 1 0 0 0 2
#> 127 4204 4.375 0 0 0 1 1 2
#> 128 4207 3.750 0 0 0 1 1 2
#> 129 4227 5.000 1 1 0 0 0 2
#> 130 4236 4.375 0 0 0 1 1 2
#> 131 4274 4.375 1 1 0 0 0 2
#> 132 4289 2.500 1 0 1 0 0 2
#> 133 4337 3.750 1 1 0 0 0 2
#> 134 4489 2.500 1 1 0 0 0 2
#> 135 4498 4.375 1 1 0 0 0 2
#> 136 4537 4.375 1 1 0 0 0 2
#> 137 4547 3.750 1 1 0 0 0 2
#> 138 4559 2.500 1 1 0 0 0 2
#> 139 4668 3.125 0 1 1 0 0 2
#> 140 4682 4.375 0 0 0 1 1 2
#> 141 4744 2.500 1 1 0 0 0 2
#> 142 4775 4.375 1 1 0 0 0 2
#> 143 4779 3.125 1 1 0 0 0 2
#> 144 4788 4.375 1 1 0 0 0 2
#> 145 4796 5.000 0 1 1 0 0 2
#> 146 4803 3.750 0 0 0 1 1 2
#> 147 4825 5.000 0 0 0 1 1 2
#> 148 4833 3.750 0 0 0 1 1 2
#> 149 4834 3.750 0 0 0 1 1 2
#> 150 4854 4.375 0 0 0 1 1 2
#> 151 4902 3.750 0 1 1 0 0 2
#> 152 4903 5.000 1 1 0 0 0 2
#> 153 4919 3.125 0 0 0 1 1 2
#> 154 4942 4.375 1 1 0 0 0 2
#> 155 4980 3.750 1 1 0 0 0 2
#> 156 4987 3.750 0 0 0 1 1 2
#> 157 5048 5.000 0 0 0 1 1 2
#> 158 5057 1.875 0 0 0 1 1 2
#> 159 5207 4.375 1 1 0 0 0 2
#> 160 5224 4.375 0 0 0 1 1 2
#> 161 5238 4.375 0 0 0 1 1 2
#> 162 5304 4.375 0 0 0 1 1 2
#> 163 5314 5.000 1 1 0 0 0 2
#> 164 5332 4.375 0 0 0 1 1 2
#> 165 5355 3.750 1 1 0 0 0 2
#> 166 5413 4.375 0 1 1 0 0 2
#> 167 5448 4.375 1 1 0 0 0 2
#> 168 5505 3.750 1 1 0 0 0 2
#> 169 5524 4.375 1 1 0 0 0 2
#> 170 5537 2.500 0 0 0 1 1 2
#> 171 5581 4.375 1 1 0 0 0 2
#> 172 5642 5.000 0 0 0 1 1 2
#> 173 5694 3.750 1 1 0 0 0 2
#> 174 5713 3.750 1 1 0 0 0 2
#> 175 5818 3.750 1 1 0 0 0 2
#> 176 5833 3.125 1 1 0 0 0 2
#> 177 5876 4.375 0 0 0 1 1 2
#> 178 5880 5.000 0 1 0 0 1 2
#> 179 5969 4.375 0 1 0 1 0 2
#> 180 6077 2.500 0 0 0 1 1 2
#> 181 6080 3.750 1 0 1 0 0 2
#> 182 6164 3.750 0 1 1 0 0 2
#> 183 6169 3.750 1 1 0 0 0 2
#> 184 6234 5.000 0 1 1 0 0 2
#> 185 6269 4.375 0 0 0 1 1 2
#> 186 6308 5.000 0 1 0 0 1 2
#> 187 6373 3.125 0 0 0 1 1 2
#> 188 6386 4.375 0 1 1 0 0 2
#> 189 6407 3.750 0 0 0 1 1 2
#> 190 6438 4.375 0 0 0 1 1 2
#> 191 6466 4.375 1 1 0 0 0 2
#> 192 6487 4.375 0 0 0 1 1 2
#> 193 6529 3.750 1 1 0 0 0 2
#> 194 6550 4.375 0 1 0 0 1 2
#> 195 6578 3.750 0 0 0 1 1 2
#> 196 6621 4.375 0 0 0 1 1 2
#> 197 6759 4.375 0 0 0 1 1 2
#> 198 6769 4.375 0 0 0 1 1 2
#> 199 6780 4.375 0 0 0 1 1 2
#> 200 6816 3.750 1 1 0 0 0 2
#> 201 6830 4.375 1 1 0 0 0 2
#> 202 6839 4.375 0 0 0 1 1 2
#> 203 6847 4.375 0 0 0 1 1 2
#> 204 6857 4.375 1 1 0 0 0 2
#> 205 6884 3.750 0 0 0 1 1 2
#> 206 6902 4.375 1 1 0 0 0 2
#> 207 6904 5.000 1 1 0 0 0 2
#> 208 6927 5.000 0 0 0 1 1 2
#> 209 6931 3.750 0 0 0 1 1 2
#> 210 6960 5.000 0 0 0 1 1 2
#> 211 7009 4.375 0 0 0 1 1 2
#> 212 7010 5.000 0 0 0 1 1 2
#> 213 7028 4.375 1 1 0 0 0 2
#> 214 7087 4.375 1 1 0 0 0 2
#> 215 7126 4.375 0 1 0 0 1 2
#> 216 7264 4.375 1 0 1 0 0 2
#> 217 7300 4.375 0 0 0 1 1 2
#> 218 7308 4.375 1 1 0 0 0 2
#> 219 7353 3.750 0 0 0 1 1 2
#> 220 7364 4.375 0 0 0 1 1 2
#> 221 7369 4.375 0 0 1 1 0 2
#> 222 7404 4.375 0 1 1 0 0 2
#> 223 7444 3.750 1 1 0 0 0 2
#> 224 7463 3.750 1 1 0 0 0 2
#> 225 7483 5.000 1 1 0 0 0 2
#> 226 7525 4.375 1 1 0 0 0 2
#> 227 7565 3.750 0 0 0 1 1 2
#> 228 7586 4.375 1 0 0 0 1 2
#> 229 7612 5.000 0 0 0 1 1 2
#> 230 7708 5.000 1 1 0 0 0 2
#> 231 7743 5.000 0 0 0 1 1 2
#> 232 7783 4.375 0 0 0 1 1 2
#> 233 7836 4.375 1 1 0 0 0 2
#> 234 7858 4.375 1 1 0 0 0 2
#> 235 7869 3.750 1 1 0 0 0 2
#> 236 7983 4.375 0 1 1 0 0 2
#> 237 8093 4.375 0 0 0 1 1 2
#> 238 8102 4.375 0 1 1 0 0 2
#> 239 8135 5.000 1 1 0 0 0 2
#> 240 8146 3.750 0 0 0 1 1 2
#> 241 8157 4.375 1 1 0 0 0 2
#> 242 8158 4.375 0 1 1 0 0 2
#> 243 8227 5.000 1 1 0 0 0 2
#> 244 8253 4.375 0 0 0 1 1 2
#> 245 8270 4.375 0 0 0 1 1 2
#> 246 8314 3.750 0 0 0 1 1 2
#> 247 8321 4.375 0 0 0 1 1 2
#> 248 8372 5.000 1 1 0 0 0 2
#> 249 8429 3.125 0 0 0 1 1 2
#> 250 8459 3.750 0 0 0 1 1 2
#> 251 8476 4.375 1 1 0 0 0 2
#> 252 8521 4.375 0 0 0 1 1 2
#> 253 8524 3.750 0 0 0 1 1 2
#> 254 8609 5.000 0 0 0 1 1 2
#> 255 8628 3.750 1 1 0 0 0 2
#> 256 8641 3.750 0 0 0 1 1 2
#> 257 8677 4.375 1 1 0 0 0 2
#> 258 8738 4.375 1 1 0 0 0 2
#> 259 8745 3.125 1 1 0 0 0 2
#> 260 8747 4.375 0 0 0 1 1 2
#> 261 8755 4.375 0 0 0 1 1 2
#> 262 8759 4.375 1 1 0 0 0 2
#> 263 8778 4.375 0 0 0 1 1 2
#> 264 8818 3.750 1 1 0 0 0 2
#> 265 8820 5.000 0 0 0 1 1 2
#> 266 8821 3.750 1 1 0 0 0 2
#> 267 8830 4.375 0 0 0 1 1 2
#> 268 8883 4.375 0 0 0 1 1 2
#> 269 8916 4.375 0 1 1 0 0 2
#> 270 8923 4.375 1 1 0 0 0 2
#> 271 8936 3.750 0 0 0 1 1 2
#> 272 8954 4.375 1 0 1 0 0 2
#> 273 8987 4.375 0 0 0 1 1 2
#> 274 8989 4.375 0 0 0 1 1 2
#> 275 8999 2.500 0 0 0 1 1 2
#> 276 9046 5.000 0 0 0 1 1 2
#> 277 9066 5.000 0 0 1 1 0 2
#> 278 9070 3.750 0 1 0 0 1 2
#> 279 9122 4.375 1 1 0 0 0 2
#> 280 9141 4.375 1 1 0 0 0 2
#> 281 9169 5.000 0 0 0 1 1 2
#> 282 9173 4.375 1 1 0 0 0 2
#> 283 9269 4.375 0 1 1 0 0 2
#> 284 9358 5.000 1 0 1 0 0 2
#> 285 9365 3.750 0 0 0 1 1 2
#> 286 9367 4.375 1 1 0 0 0 2
#> 287 9396 3.750 0 0 0 1 1 2
#> 288 9405 3.750 1 1 0 0 0 2
#> 289 9420 4.375 0 1 1 0 0 2
#> 290 9428 4.375 1 1 0 0 0 2
#> 291 9554 4.375 0 1 0 1 0 2
#> 292 9557 3.750 1 1 0 0 0 2
#> 293 9598 4.375 0 0 0 1 1 2
#> 294 9628 5.000 0 0 0 1 1 2
#> 295 9643 2.500 0 1 0 0 1 2
#> 296 9699 4.375 0 0 0 1 1 2
#> 297 9816 3.125 1 1 0 0 0 2
#> 298 9844 4.375 1 1 0 0 0 2
#> 299 9917 4.375 0 0 0 1 1 2
#> 300 9955 3.750 0 1 1 0 0 2
#> 301 9957 5.000 1 1 0 0 0 2
#> 302 9961 4.375 0 0 0 1 1 2
#> 303 10009 3.125 0 0 0 1 1 2
#> 304 10019 3.750 0 0 0 1 1 2
#> 305 10067 3.750 0 0 0 1 1 2
#> 306 10088 4.375 0 1 1 0 0 2
#> 307 10105 4.375 1 1 0 0 0 2
#> 308 10152 4.375 1 1 0 0 0 2
#> 309 10239 3.750 0 0 0 1 1 2
#> 310 10363 2.500 1 1 0 0 0 2
#> 311 10373 3.750 1 1 0 0 0 2
#> 312 10393 3.125 0 0 0 1 1 2
#> 313 10430 3.750 1 1 0 0 0 2
#> 314 10564 4.375 0 0 0 1 1 2
#> 315 10598 5.000 1 1 0 0 0 2
#> 316 10690 5.000 1 1 0 0 0 2
#> 317 10748 4.375 0 1 1 0 0 2
#> 318 10763 2.500 0 0 0 1 1 2
#> 319 10770 4.375 1 1 0 0 0 2
#> 320 10800 3.750 1 1 0 0 0 2
#> 321 10907 4.375 1 1 0 0 0 2
#> 322 10926 4.375 1 1 0 0 0 2
#> 323 10928 5.000 0 0 0 1 1 2
#> 324 10959 4.375 1 1 0 0 0 2
#> 325 10967 3.750 1 1 0 0 0 2
#> 326 11010 5.000 1 0 1 0 0 2
#> 327 11046 4.375 0 0 0 1 1 2
#> 328 11048 3.750 0 1 0 0 1 2
#> 329 11097 5.000 0 1 1 0 0 2
#> 330 11101 1.250 0 0 0 1 1 2
#> 331 11124 4.375 1 1 0 0 0 2
#> 332 11135 2.500 0 1 1 0 0 2
#> 333 11157 5.000 0 0 0 1 1 2
#> 334 11169 5.000 0 0 0 1 1 2
#> 335 11177 3.750 0 0 0 1 1 2
#> 336 11215 3.750 1 1 0 0 0 2
#> 337 11242 3.750 1 1 0 0 0 2
#> 338 11249 4.375 1 1 0 0 0 2
#> 339 11260 4.375 1 1 0 0 0 2
#> 340 11295 4.375 0 0 0 1 1 2
#> 341 11315 3.750 0 1 1 0 0 2
#> 342 11324 4.375 0 0 0 1 1 2
#> 343 11396 3.750 1 1 0 0 0 2
#> 344 11478 4.375 1 1 0 0 0 2
#> 345 11496 3.750 0 0 0 1 1 2
#> 346 11535 4.375 1 1 0 0 0 2
#> 347 11660 5.000 0 0 0 1 1 2
#> 348 11679 4.375 0 0 0 1 1 2
#> 349 11701 5.000 0 0 0 1 1 2
#> 350 11714 4.375 1 0 1 0 0 2
#> 351 11753 4.375 0 1 0 1 0 2
#> 352 11762 5.000 0 0 0 1 1 2
#> 353 11768 5.000 0 0 0 1 1 2
#> 354 11784 4.375 0 0 0 1 1 2
#> 355 11801 5.000 0 0 0 1 1 2
#> 356 11840 4.375 0 0 0 1 1 2
#> 357 11847 3.750 1 1 0 0 0 2
#> 358 11952 3.750 1 0 1 0 0 2
#> 359 11985 4.375 0 0 0 1 1 2
#> 360 12074 4.375 1 1 0 0 0 2
#> 361 12100 3.750 0 1 1 0 0 2
#> 362 12116 4.375 0 1 1 0 0 2
#> 363 12177 4.375 0 0 0 1 1 2
#> 364 12180 3.750 1 1 0 0 0 2
#> 365 12202 5.000 1 1 0 0 0 2
#> 366 12224 5.000 0 0 0 1 1 2
#> 367 12239 3.750 0 0 0 1 1 2
#> 368 12342 3.750 0 0 0 1 1 2
#> 369 12350 4.375 1 1 0 0 0 2
#> 370 12455 5.000 0 0 0 1 1 2
#> 371 12469 3.125 1 0 1 0 0 2
#> 372 12533 5.000 0 0 1 1 0 2
#> 373 12539 3.750 1 1 0 0 0 2
#> 374 12554 4.375 1 1 0 0 0 2
#> 375 12556 4.375 0 0 0 1 1 2
#> 376 12580 3.750 1 1 0 0 0 2
#> 377 12582 5.000 0 0 0 1 1 2
#> 378 12661 4.375 0 0 0 1 1 2
#> 379 12666 4.375 0 1 1 0 0 2
#> 380 12771 5.000 0 0 0 1 1 2
#> 381 12773 3.750 1 1 0 0 0 2
#> 382 12847 5.000 1 1 0 0 0 2
#> 383 12869 4.375 1 1 0 0 0 2
#> 384 12919 3.125 0 0 0 1 1 2
#> 385 12946 4.375 1 1 0 0 0 2
#> 386 12962 4.375 0 0 0 1 1 2
#> 387 12984 3.750 0 0 0 1 1 2
#> 388 13000 4.375 1 1 0 0 0 2
#> 389 13054 3.750 0 0 0 1 1 2
#> 390 13086 3.750 1 1 0 0 0 2
#> 391 13137 4.375 1 0 1 0 0 2
#> 392 13139 4.375 0 0 0 1 1 2
#> 393 13212 4.375 0 0 0 1 1 2
#> 394 13214 3.125 1 1 0 0 0 2
#> 395 13252 4.375 0 0 0 1 1 2
#> 396 13261 4.375 0 1 0 0 1 2
#> 397 13273 5.000 1 1 0 0 0 2
#> 398 13284 3.750 0 0 0 1 1 2
#> 399 13287 1.875 1 1 0 0 0 2
#> 400 13304 4.375 0 1 0 0 1 2
#> 401 13335 4.375 1 0 1 0 0 2
#> 402 13339 3.750 0 0 0 1 1 2
#> 403 13357 4.375 0 0 0 1 1 2
#> 404 13377 3.750 0 0 0 1 1 2
#> 405 13412 4.375 0 0 0 1 1 2
#> 406 13417 2.500 1 1 0 0 0 2
#> 407 13428 3.750 1 1 0 0 0 2
#> 408 13458 3.750 0 0 0 1 1 2
#> 409 13553 5.000 1 1 0 0 0 2
#> 410 13636 4.375 0 0 0 1 1 2
#> 411 13759 4.375 1 1 0 0 0 2
#> 412 13770 4.375 1 1 0 0 0 2
#> 413 13826 4.375 1 1 0 0 0 2
#> 414 13916 3.750 0 0 0 1 1 2
#> 415 13947 4.375 1 1 0 0 0 2
#> 416 14020 4.375 1 1 0 0 0 2
#> 417 14021 4.375 0 0 0 1 1 2
#> 418 14031 4.375 1 1 0 0 0 2
#> 419 14095 3.750 1 1 0 0 0 2
#> 420 14164 3.750 1 1 0 0 0 2
#> 421 14196 4.375 0 0 0 1 1 2
#> 422 14210 2.500 0 0 0 1 1 2
#> 423 14213 4.375 1 0 1 0 0 2
#> 424 14219 3.125 0 0 0 1 1 2
#> 425 14284 4.375 0 0 0 1 1 2
#> 426 14298 3.125 0 1 0 0 1 2
#> 427 14332 5.000 1 1 0 0 0 2
#> 428 14336 3.750 0 0 0 1 1 2
#> 429 14362 4.375 1 1 0 0 0 2
#> 430 14457 3.750 0 0 0 1 1 2
#> 431 14506 5.000 1 1 0 0 0 2
#> 432 14536 3.750 1 1 0 0 0 2
#> 433 14581 5.000 0 0 0 1 1 2
#> 434 14639 3.750 1 1 0 0 0 2
#> 435 14696 5.000 1 1 0 0 0 2
#> 436 14729 3.750 0 0 0 1 1 2
#> 437 14739 3.750 1 1 0 0 0 2
#> 438 14774 4.375 0 1 1 0 0 2
#> 439 14832 3.750 1 0 1 0 0 2
#> 440 14845 3.125 1 1 0 0 0 2
#> 441 14950 3.750 1 1 0 0 0 2
#> 442 15014 3.750 0 1 0 0 1 2
#> 443 15053 3.750 0 0 0 1 1 2
#> 444 15078 5.000 1 0 1 0 0 2
#> 445 15115 2.500 0 0 0 1 1 2
#> 446 15185 3.125 1 1 0 0 0 2
#> 447 15188 4.375 1 1 0 0 0 2
#> 448 15194 4.375 1 1 0 0 0 2
#> 449 15206 4.375 1 1 0 0 0 2
#> 450 15266 4.375 1 1 0 0 0 2
#> 451 15276 5.000 0 0 0 1 1 2
#> 452 15289 3.125 0 1 1 0 0 2
#> 453 15302 4.375 0 0 0 1 1 2
#> 454 15303 4.375 0 0 0 1 1 2
#> 455 15307 4.375 0 0 0 1 1 2
#> 456 15309 5.000 0 1 1 0 0 2
#> 457 15388 5.000 1 1 0 0 0 2
#> 458 15397 4.375 1 1 0 0 0 2
#> 459 15410 4.375 0 0 0 1 1 2
#> 460 15430 4.375 1 1 0 0 0 2
#> 461 15518 4.375 1 1 0 0 0 2
#> 462 15683 3.750 0 1 1 0 0 2
#> 463 15703 3.125 1 1 0 0 0 2
#> 464 15707 5.000 0 0 0 1 1 2
#> 465 15738 4.375 1 1 0 0 0 2
#> 466 15741 4.375 1 1 0 0 0 2
#> 467 15768 5.000 0 1 1 0 0 2
#> 468 15793 3.750 0 0 0 1 1 2
#> 469 15836 3.750 0 0 0 1 1 2
#> 470 15886 4.375 1 1 0 0 0 2
#> 471 15915 1.250 0 0 0 1 1 2
#> 472 15943 3.750 0 0 0 1 1 2
#> 473 15945 3.750 1 1 0 0 0 2
#> 474 15986 3.750 1 1 0 0 0 2
#> 475 16086 4.375 0 0 0 1 1 2
#> 476 16103 4.375 0 0 0 1 1 2
#> 477 16288 4.375 0 0 0 1 1 2
#> 478 16322 3.750 0 0 0 1 1 2
#> 479 16345 4.375 0 1 1 0 0 2
#> 480 16427 4.375 0 0 0 1 1 2
#> 481 16549 4.375 1 1 0 0 0 2
#> 482 16622 3.750 0 0 0 1 1 2
#> 483 16667 4.375 0 0 0 1 1 2
#> 484 16668 4.375 0 0 0 1 1 2
#> 485 16696 4.375 0 0 0 1 1 2
#> 486 16794 4.375 1 1 0 0 0 2
#> 487 16796 4.375 1 1 0 0 0 2
#> 488 16849 3.750 1 1 0 0 0 2
#> 489 16867 4.375 0 0 0 1 1 2
#> 490 16872 4.375 0 1 0 0 1 2
#> 491 16903 3.750 0 0 0 1 1 2
#> 492 16908 5.000 0 0 0 1 1 2
#> 493 16941 3.750 0 1 0 0 1 2
#> 494 16966 4.375 0 0 0 1 1 2
#> 495 17007 4.375 1 1 0 0 0 2
#> 496 17031 3.125 1 1 0 0 0 2
#> 497 17067 3.750 1 1 0 0 0 2
#> 498 17105 5.000 0 0 1 0 1 2
#> 499 17137 4.375 0 0 0 1 1 2
#> 500 17201 4.375 0 1 0 0 1 2
#> 501 17215 5.000 1 1 0 0 0 2
#> 502 17271 4.375 1 1 0 0 0 2
#> 503 17314 5.000 0 0 0 1 1 2
#> 504 17397 4.375 1 1 0 0 0 2
#> 505 17455 4.375 1 1 0 0 0 2
#> 506 17495 3.750 0 0 0 1 1 2
#> 507 17507 3.750 1 1 0 0 0 2
#> 508 17550 3.125 0 0 0 1 1 2
#> 509 17555 3.750 1 0 1 0 0 2
#> 510 17563 4.375 0 0 0 1 1 2
#> 511 17600 5.000 1 1 0 0 0 2
#> 512 17614 4.375 0 0 0 1 1 2
#> 513 17666 3.750 1 1 0 0 0 2
#> 514 17715 3.750 0 0 0 1 1 2
#> 515 17742 3.125 1 0 1 0 0 2
#> 516 17764 3.750 0 0 0 1 1 2
#> 517 17793 3.750 1 1 0 0 0 2
#> 518 17814 3.125 1 1 0 0 0 2
#> 519 17823 3.750 1 1 0 0 0 2
#> 520 17833 5.000 1 1 0 0 0 2
#> 521 17913 4.375 0 0 0 1 1 2
#> 522 18063 4.375 0 0 0 1 1 2
#> 523 18078 3.125 0 1 0 0 1 2
#> 524 18116 4.375 0 0 0 1 1 2
#> 525 18121 3.750 0 0 0 1 1 2
#> 526 18140 3.750 1 1 0 0 0 2
#> 527 18141 1.250 1 1 0 0 0 2
#> 528 18173 3.750 1 1 0 0 0 2
#> 529 18179 4.375 0 0 0 1 1 2
#> 530 18238 3.750 1 1 0 0 0 2
#> 531 18242 4.375 0 0 0 1 1 2
#> 532 18272 3.750 1 1 0 0 0 2
#> 533 18349 4.375 0 1 1 0 0 2
#> 534 18357 3.750 1 1 0 0 0 2
#> 535 18408 2.500 0 0 0 1 1 2
#> 536 18440 4.375 0 1 0 0 1 2
#> 537 18464 4.375 0 0 0 1 1 2
#> 538 18548 4.375 0 0 0 1 1 2
#> 539 18579 3.750 1 1 0 0 0 2
#> 540 18587 3.750 1 1 0 0 0 2
#> 541 18595 4.375 0 0 0 1 1 2
#> 542 18612 4.375 0 0 0 1 1 2
#> 543 18706 5.000 1 1 0 0 0 2
#> 544 18756 5.000 0 0 0 1 1 2
#> 545 18786 3.750 1 0 1 0 0 2
#> 546 18792 4.375 1 1 0 0 0 2
#> 547 18885 4.375 0 0 0 1 1 2
#> 548 18898 4.375 0 0 0 1 1 2
#> 549 18948 4.375 0 0 0 1 1 2
#> 550 18959 3.750 1 1 0 0 0 2
#> 551 18983 4.375 0 0 0 1 1 2
#> 552 19029 2.500 1 1 0 0 0 2
#> 553 19046 3.750 0 0 0 1 1 2
#> 554 19118 4.375 0 1 1 0 0 2
#> 555 19160 5.000 0 1 1 0 0 2
#> 556 19208 4.375 0 0 0 1 1 2
#> 557 19278 4.375 0 0 0 1 1 2
#> 558 19310 3.750 0 0 0 1 1 2
#> 559 19345 4.375 0 0 0 1 1 2
#> 560 19357 3.750 1 1 0 0 0 2
#> 561 19371 3.125 0 1 1 0 0 2
#> 562 19400 5.000 1 1 0 0 0 2
#> 563 19423 3.750 0 0 0 1 1 2
#> 564 19429 3.125 1 1 0 0 0 2
#> 565 19469 4.375 1 1 0 0 0 2
#> 566 19470 4.375 0 0 0 1 1 2
#> 567 19472 3.125 0 0 0 1 1 2
#> 568 19547 3.750 0 0 0 1 1 2
#> 569 19586 4.375 0 0 0 1 1 2
#> 570 19623 4.375 1 1 0 0 0 2
#> 571 19633 3.125 1 1 0 0 0 2
#> 572 19650 3.750 0 0 0 1 1 2
#> 573 19698 4.375 1 1 0 0 0 2
#> 574 19726 3.750 1 1 0 0 0 2
#> 575 19760 3.125 0 1 0 0 1 2
#> 576 19812 1.250 0 0 0 1 1 2
#> 577 19881 4.375 0 0 0 1 1 2
#> 578 19906 3.750 0 1 0 0 1 2
#> 579 19941 3.750 1 0 1 0 0 2
#> 580 19967 4.375 1 1 0 0 0 2
#> 581 20018 4.375 1 1 0 0 0 2
# left with 3333 obs after filtering
type_df <- type_df %>%
filter(!sum_type >1)Once again very low correlation for recipe types, we will not included them in the analysis.
type_df %>%
select(-c(ID, sum_type)) %>%
plot_correlation()
### Countries Only 54 recipes containing info about the country so we
can’t use it either for the analysis.
countries_df <- recipes %>%
select(ID, rating, all_of(countries)) %>%
filter(if_any(all_of(countries)) == 1) %>%
mutate(sum_type= rowSums(across(all_of(countries))))
countries_df %>%
ggplot(aes(x=sum_type)) +
geom_bar()#,'03_supervised_learning.Rmd', '04_unsupervised_learning.Rmd'Questions - should we convert the binary columns to factor or can we leave them as integer for modelling? - should we balance the data, in the rating_bin case and in the rating normal with 7 classes - should we normalise the numerical data - it’s mentionned in the slides that the validation set should not be balanced, but how do we do that using train() with caret? - should we really used balanced data for training? Because at least for KNN it always makes K=1 better, whereas K is was larger when we trained with unbalanced data Apparently for KNN, it’s not required to balance data